7

I was playing around with different implementations of the Euclidean distance metric and I noticed that I get different results for Scipy, pure Python, and Java.

Here's how I compute the distance using Scipy (= option 1):

distance = scipy.spatial.distance.euclidean(sample, training_vector)

here's an implementation in Python I found in a forum (option 2):

distance = math.sqrt(sum([(a - b) ** 2 for a, b in zip(training_vector, sample)]))

and lastly, here's my implementation in Java (option 3):

public double distance(int[] a, int[] b) {
    assert a.length == b.length;
    double squaredDistance = 0.0;
    for(int i=0; i<a.length; i++){
        squaredDistance += Math.pow(a[i] - b[i], 2.0);
    }
    return Math.sqrt(squaredDistance);
}

Both sample and training_vector are 1-D arrays with length 784, taken from the MNIST dataset. I tried all three methods with the same sample and training_vector. The problem is that the three different methods result in three significantly different distances (that is, around 1936 for option 1, 1914 for option 2, and 1382 for option 3). Interestingly, when I use the same argument order for sample and training_vector in options 1 and 2 (i.e. flip the arguments to option 1 around), I get the same result for these two options. But distance metrics are supposed to be symmetrical, right...?

What's also interesting: I'm using these metrics for a k-NN classifier for the MNIST dataset. My Java implementation yields an accuracy of around 94% for 100 test samples and 2700 training samples. However, the Python implementation using option 1 only yields an accuracy of about 75%...

Do you have any ideas as to why I'm getting these different results? If you are interested, I can post a CSV for two arrays online, and post a link here.

I'm using Java 8, Python 2.7, and Scipy 1.0.0.

Edit: Changed option 2 to

distance = math.sqrt(sum([(float(a) - float(b)) ** 2 for a, b in zip(training_vector, sample)]))

This had the following effects:

  • it got rid of a ubyte overflow warning (I must have missed this warning before...)
  • changing the argument order for options 1 and 2 no longer makes a difference.
  • the results for options 2 (pure Python) and 3 (Java) are now equal

So, this only leaves the following problem: why is the result different (i.e. wrong?) when using SciPy?

6
  • I read what you wrote, but print len(sample) and len(training_vector) Feb 28, 2018 at 13:17
  • I can't reproduce your problem.
    – IMCoins
    Feb 28, 2018 at 13:18
  • @MattTimmermans it returns 784 for both vectors Feb 28, 2018 at 13:29
  • in option 2, try float(a) - float(b) instead of a-b Feb 28, 2018 at 13:32
  • @MattTimmermans thanks, that solved a big part of the problem! Now only the SciPy solution is different (see edit in original question) Feb 28, 2018 at 13:39

1 Answer 1

1

Okay, I found the solution: I had imported both the training and test dataset using pandas with dtype=np.uint8. Consequently, sample and training_vector were both numpy arrays with type uint8. I changed the data type to np.float32 and now all my three options give the same results. I also tried np.uint32 and it works as well.

I'm not quite sure why, but apparently, SciPy doesn't give the "expected" result when working with uint8. Maybe there was some internal overflow in SciPy? Not quite sure, but at least it works now. Thanks to everyone who helped!

1
  • Thanks for this! I had the same problem and was stuck trying to figure out what was wrong with my code for hours since all my formulas were correct. Just to add for anyone who might need it: numpy,linalg.norm has the same problem.
    – iamnobody
    Sep 15, 2018 at 16:01

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.