8

I need to generate a dataframe with minimum euclidean distance between each row of a dataframe and all other rows of another dataframe.Both my dataframes are large (approx 40,000 rows).This is what I could work out till now.

x<-matrix(c(3,6,3,4,8),nrow=5,ncol=7,byrow = TRUE)     
y<-matrix(c(1,4,4,1,9),nrow=5,ncol=7,byrow = TRUE)


sed.dist<-numeric(5)
for (i in 1:(length(sed.dist))) {
sed.dist[i]<-(sqrt(sum((y[i,1:7] - x[i,1:7])^2)))
}

But this only works when i=j.What I essentially need is to find the minimum euclidean distance by looping over every row one by one ( y[1,1:7],then y[2,1:7] and so on till i= 5 ) of the "y" dataframe with all the rows of the "x"dataframe(x[i,1:7]).Each time it does this,I need it to find the minimum euclidean distance for each computation of row i of the y dataframe and all the rows of the x dataframe and store it in another dataframe.

2
  • This sqrt(colSums((y[1, ] - t(x))^2)) computes the distance of row 1 in y with all rows in x. You want the min of this and, also, repeated for every other row in y?
    – alexis_laz
    Mar 6, 2014 at 18:10
  • Yes thats what I want
    – user14845
    Mar 7, 2014 at 3:56

2 Answers 2

5

Try this:

apply(y,1,function(y) min(apply(x,1,function(x,y)dist(rbind(x,y)),y)))
# [1] 5.196152 5.385165 4.898979 4.898979 5.385165

Working from the inside out, we bind a row of x to a row of y and calcualte the distance between them usin the dist(...) function (written in C). We do this for a given row of y with each row of x in turn, using the inner apply(...), and then find the minimum of the result. Then we do this for each row of y in the outer call to apply(...).

1
  • Thank you very much ..it worked perfectly fine but took very long to run...Nevertheless thanks for the help.
    – user14845
    Mar 10, 2014 at 9:10
3

Expanding on my comment on the question, a pretty fast approach would be the following, although with 40,000 rows you'll have to wait a bit, I guess:

unlist(lapply(seq_len(nrow(y)), function(i) min(sqrt(colSums((y[i, ] - t(x))^2)))))
#[1] 5.196152 5.385165 4.898979 4.898979 5.385165

And a comparing benchmarking:

x = matrix(runif(1e2*5), 1e2)
y = matrix(runif(1e2*5), 1e2)
library(microbenchmark)
alex = function() unlist(lapply(seq_len(nrow(y)), 
                           function(i) min(sqrt(colSums((y[i, ] - t(x))^2)))))
jlhoward = function() apply(y,1,function(y)
                                  min(apply(x,1,function(x,y)dist(rbind(x,y)),y)))
all.equal(alex(), jlhoward())
#[1] TRUE
microbenchmark(alex(), jlhoward(), times = 20)
#Unit: milliseconds
#       expr        min         lq     median         uq        max neval
#     alex()   3.369188   3.479011   3.600354   4.513114   4.789592    20
# jlhoward() 422.198621 431.565643 436.561057 442.643181 602.929742    20
0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.