+1 to all of the above esp. Dimensionality reduction and locality sensitive hashing / min
hashing.
There's also an algorithm implemented in MLlib called DIMSUM which was developed at Twitter
for this purpose. I've been meaning to try it and would be interested to hear about results
you get.
https://blog.twitter.com/2014/allpairssimilarityviadimsum
Charlie
—
Sent from Mailbox
On Wednesday, Aug 26, 2015 at 09:57, Michael Malak <michaelmalak@yahoo.com.invalid>,
wrote:
Yes. And a paper that describes using grids (actually varying grids) is http://research.microsoft.com/enus/um/people/jingdw/pubs%5CCVPR12GraphConstruction.pdf In
the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically
simplified version of this in chapter 7, which should become available in the MEAP midSeptember. http://www.manning.com/books/sparkgraphxinaction
If you don't want to compute all N^2 similarities, you need to implement some kind of blocking
first. For example, LSH (locally sensitive hashing). A quick search gave this link to a Spark
implementation:
http://stackoverflow.com/questions/27718888/sparkimplementationforlocalitysensitivehashing
On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <jaonary@gmail.com> wrote:
Dear all,
I'm trying to find an efficient way to build a kNN graph for a large dataset. Precisely,
I have a large set of high dimensional vector (say d >>> 10000) and I want to build
a graph where those high dimensional points are the vertices and each one is linked to the
knearest neighbor based on some kind similarity defined on the vertex spaces.
My problem is to implement an efficient algorithm to compute the weight matrix of the graph.
I need to compute a N*N similarities and the only way I know is to use "cartesian" operation
follow by "map" operation on RDD. But, this is very slow when the N is large. Is there a more
cleaver way to do this for an arbitrary similarity function ?
Cheers,
Jao
