mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Patterson <j...@cloudera.com>
Subject Re: Generic approach to kNN
Date Thu, 13 Oct 2011 14:45:17 GMT
If you want to keep it all in memory and you have under say a GB (or
so) of data, you could just use Weka's BallTree:

An example of where I used this before -----

https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/src/TVA/Hadoop/MapReduce/Datamining/Weka/WekaUtils.java

I believe the search time on a BallTree for kNN is Log(N) which is
handy as well.

JP

On Thu, Oct 13, 2011 at 1:32 AM, Felix Filozov <ffilozov@gmail.com> wrote:
> I decided that since I don't have as much data as I thought I would have, I
> would simply choose an optimized data structure to hold my data set, which
> I'd query locally.
>
> I did start looking into distributed NN options and ways of parallelizing NN
> as well.
>
> Thanks for all the help.
>
> On Wed, Oct 12, 2011 at 11:26 PM, Josh Patterson <josh@cloudera.com> wrote:
>
>> Without knowing a lot about what you are doing, I'd say you could just
>> do this rather simply as Sean has said with a basic similarity
>> function;
>>
>> The really simple "batch" version of this might be:
>>
>> 1. Define similarity function
>> 2. Input of some sort of "base point / instance" which we'll use to
>> search against
>> 3. the map side of the MR job just takes each input vector and scores
>> it with the distance function
>> 4. output using the total order partitioner, sorting on distance score
>> 5. look at the first k entries on the front end of the thing
>>
>> A more complicated option might be something along the lines of "MD-tree":
>>
>> http://www.cs.ucsb.edu/~sudipto/papers/md-hbase.pdf
>>
>> where they are storing a k-d tree in HBase to give relatively low
>> latency kNN search queries.
>>
>> The batch version seems like it might be a nice place to start.
>>
>> Hope this helps,
>>
>> JP
>>
>>
>> On Mon, Oct 10, 2011 at 3:26 PM, Felix Filozov <ffilozov@gmail.com> wrote:
>> > I would like perform a kNN similarity search, where each data point is a
>> N
>> > dimensional vector and each coordinate in the vector may take on any
>> value
>> > (reals or strings). It seems to me that Mahout doesn't have the ability
>> to
>> > perform a generic kNN similarity search, instead the problem has to be
>> > mapped to a recommender. Is Mahout the right tool for this task?
>> >
>> > If it is, how have you dealt with the mapping, and if not, what would you
>> > recommend?
>> >
>> > Thanks.
>> >
>> > Felix
>> >
>>
>>
>>
>> --
>> Twitter: @jpatanooga
>> Solution Architect @ Cloudera
>> hadoop: http://www.cloudera.com
>>
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com

Mime
View raw message