I was playing around with different implementations of the Euclidean distance metric and I noticed that I get different results for Scipy, pure Python, and Java.
Here's how I compute the distance using Scipy (= option 1):
distance = scipy.spatial.distance.euclidean(sample, training_vector)
here's an implementation in Python I found in a forum (option 2):
distance = math.sqrt(sum([(a - b) ** 2 for a, b in zip(training_vector, sample)]))
and lastly, here's my implementation in Java (option 3):
public double distance(int[] a, int[] b) {
assert a.length == b.length;
double squaredDistance = 0.0;
for(int i=0; i<a.length; i++){
squaredDistance += Math.pow(a[i] - b[i], 2.0);
}
return Math.sqrt(squaredDistance);
}
Both sample
and training_vector
are 1-D arrays with length 784, taken from the MNIST dataset. I tried all three methods with the same sample
and training_vector
. The problem is that the three different methods result in three significantly different distances (that is, around 1936 for option 1, 1914 for option 2, and 1382 for option 3). Interestingly, when I use the same argument order for sample
and training_vector
in options 1 and 2 (i.e. flip the arguments to option 1 around), I get the same result for these two options. But distance metrics are supposed to be symmetrical, right...?
What's also interesting: I'm using these metrics for a k-NN classifier for the MNIST dataset. My Java implementation yields an accuracy of around 94% for 100 test samples and 2700 training samples. However, the Python implementation using option 1 only yields an accuracy of about 75%...
Do you have any ideas as to why I'm getting these different results? If you are interested, I can post a CSV for two arrays online, and post a link here.
I'm using Java 8, Python 2.7, and Scipy 1.0.0.
Edit: Changed option 2 to
distance = math.sqrt(sum([(float(a) - float(b)) ** 2 for a, b in zip(training_vector, sample)]))
This had the following effects:
- it got rid of a ubyte overflow warning (I must have missed this warning before...)
- changing the argument order for options 1 and 2 no longer makes a difference.
- the results for options 2 (pure Python) and 3 (Java) are now equal
So, this only leaves the following problem: why is the result different (i.e. wrong?) when using SciPy?
float(a) - float(b)
instead ofa-b