spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaonary Rabarisoa <jaon...@gmail.com>
Subject Re: Build k-NN graph for large dataset
Date Thu, 27 Aug 2015 06:16:37 GMT
Thank you all for these links. I'll check them.

On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack <charles.t.hack@gmail.com>
wrote:

> +1 to all of the above esp.  Dimensionality reduction and locality
> sensitive hashing / min hashing.
>
> There's also an algorithm implemented in MLlib called DIMSUM which was
> developed at Twitter for this purpose. I've been meaning to try it and
> would be interested to hear about results you get.
>
> https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
>
> ​Charlie
>
>
> — Sent from Mailbox
>
> On Wednesday, Aug 26, 2015 at 09:57, Michael Malak <
> michaelmalak@yahoo.com.invalid>, wrote:
>
>> Yes. And a paper that describes using grids (actually varying grids) is
>> http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
In
>> the Spark GraphX In Action book that Robin East and I are writing, we
>> implement a drastically simplified version of this in chapter 7, which
>> should become available in the MEAP mid-September.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>> ------------------------------
>>
>> If you don't want to compute all N^2 similarities, you need to implement
>> some kind of blocking first. For example, LSH (locally sensitive hashing).
>> A quick search gave this link to a Spark implementation:
>>
>>
>> http://stackoverflow.com/questions/27718888/spark-implementation-for-locality-sensitive-hashing
>>
>>
>>
>> On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <jaonary@gmail.com>
>> wrote:
>>
>> Dear all,
>>
>> I'm trying to find an efficient way to build a k-NN graph for a large
>> dataset. Precisely, I have a large set of high dimensional vector (say d
>> >>> 10000) and I want to build a graph where those high dimensional points
>> are the vertices and each one is linked to the k-nearest neighbor based on
>> some kind similarity defined on the vertex spaces.
>> My problem is to implement an efficient algorithm to compute the weight
>> matrix of the graph. I need to compute a N*N similarities and the only way
>> I know is to use "cartesian" operation follow by "map" operation on RDD.
>> But, this is very slow when the N is large. Is there a more cleaver way to
>> do this for an arbitrary similarity function ?
>>
>> Cheers,
>>
>> Jao
>>
>>
>>
>>
>>

Mime
View raw message