Euclidean distance, different results between Scipy, pure Python, and Java

Question

I was playing around with different implementations of the Euclidean distance metric and I noticed that I get different results for Scipy, pure Python, and Java.

Here's how I compute the distance using Scipy (= option 1):

distance = scipy.spatial.distance.euclidean(sample, training_vector)

here's an implementation in Python I found in a forum (option 2):

distance = math.sqrt(sum([(a - b) ** 2 for a, b in zip(training_vector, sample)]))

and lastly, here's my implementation in Java (option 3):

public double distance(int[] a, int[] b) {
    assert a.length == b.length;
    double squaredDistance = 0.0;
    for(int i=0; i<a.length; i++){
        squaredDistance += Math.pow(a[i] - b[i], 2.0);
    }
    return Math.sqrt(squaredDistance);
}

Both sample and training_vector are 1-D arrays with length 784, taken from the MNIST dataset. I tried all three methods with the same sample and training_vector. The problem is that the three different methods result in three significantly different distances (that is, around 1936 for option 1, 1914 for option 2, and 1382 for option 3). Interestingly, when I use the same argument order for sample and training_vector in options 1 and 2 (i.e. flip the arguments to option 1 around), I get the same result for these two options. But distance metrics are supposed to be symmetrical, right...?

What's also interesting: I'm using these metrics for a k-NN classifier for the MNIST dataset. My Java implementation yields an accuracy of around 94% for 100 test samples and 2700 training samples. However, the Python implementation using option 1 only yields an accuracy of about 75%...

Do you have any ideas as to why I'm getting these different results? If you are interested, I can post a CSV for two arrays online, and post a link here.

I'm using Java 8, Python 2.7, and Scipy 1.0.0.

Edit: Changed option 2 to

distance = math.sqrt(sum([(float(a) - float(b)) ** 2 for a, b in zip(training_vector, sample)]))

This had the following effects:

it got rid of a ubyte overflow warning (I must have missed this warning before...)
changing the argument order for options 1 and 2 no longer makes a difference.
the results for options 2 (pure Python) and 3 (Java) are now equal

So, this only leaves the following problem: why is the result different (i.e. wrong?) when using SciPy?

I read what you wrote, but print len(sample) and len(training_vector) — Matt Timmermans, Feb 28, 2018 at 13:17
@MattTimmermans thanks, that solved a big part of the problem! Now only the SciPy solution is different (see edit in original question) — Silas Berger, Feb 28, 2018 at 13:39

Silas Berger · Accepted Answer · 2018-02-28 14:43:56Z

1

Okay, I found the solution: I had imported both the training and test dataset using pandas with dtype=np.uint8. Consequently, sample and training_vector were both numpy arrays with type uint8. I changed the data type to np.float32 and now all my three options give the same results. I also tried np.uint32 and it works as well.

I'm not quite sure why, but apparently, SciPy doesn't give the "expected" result when working with uint8. Maybe there was some internal overflow in SciPy? Not quite sure, but at least it works now. Thanks to everyone who helped!

answered Feb 28, 2018 at 14:43

Silas Berger

1296 bronze badges

Thanks for this! I had the same problem and was stuck trying to figure out what was wrong with my code for hours since all my formulas were correct. Just to add for anyone who might need it: numpy,linalg.norm has the same problem.
– iamnobody
Sep 15, 2018 at 16:01

Add a comment |

Collectives™ on Stack Overflow

Euclidean distance, different results between Scipy, pure Python, and Java

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
java
python
scipy
knn
euclidean-distance
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged javapythonscipyknneuclidean-distance or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
java
python
scipy
knn
euclidean-distance
or ask your own question.