spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gylfi <>
Subject Re: K Nearest Neighbours
Date Sat, 18 Jul 2015 10:18:57 GMT

What I would do in your case would be something like this.. 

Lets call the two datasets, qs and ds, where qs is an array of vectors and
ds is an RDD[(dsID: Long, Vector)]. 

Do the following: 
1) create a k-NN class that can keep track of the k-Nearest Neighbors so
far. It must have a qsID and some structure for the k nearest neighbors
Seq[(dsID:Long, Distance: Long)]  and the function .add( nn : (Long, Vector)
) that will do the distance calc and update the kNN when appropriate.  
2) collect the qs and key-it as well, so each qs has an ID, i.e. qs =
Array[(qsID : Long, Vector)]

Now what you want to do is not create all the distance stuff, but just the
k-NNs. To do this we will actually create a few k-NN for each query vector,
one for each partition, and then merge them later. 

3) do a ds.mapPartition() and inside the function you create a k-NN for the
each qs, scan the ds points of the partition and output an iterator pointing
to the set of k-NNs created. 
val k = 100
val qs = new Array[(KNNClass)]()
val ds = RDD[(Long, Vector)]() 
val knnResults = ds.mapPartitions( itr => {
      val knns = qp =>  (qp._1, new KNNClass(k, qp) )
      itr.foreach( dp => {
        knns.foreach( knn => knn.add( dp ))
      } )

Now you have one k-NN per partition for each query point, but this we can
simply fix by doing a reduceByKey and merge all the k-NNs for each qpID into
a single k-NN. 

val knnResultFinal = knnResults.reduceByKey( (a, b) => KNNClass.merge( a, b)

Where you have a static function that merges the two k-NNs, i.e. we simply
concatenate them and sort on distance, and then take the k top values and
returns them as a new knn class. 

If you want to control how many k-NNs are create you can always repartition

How does that sound? Does this make any sense?   :) 


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message