mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Generic approach to kNN
Date Tue, 11 Oct 2011 08:18:08 GMT
It doesn't sound like you really have or want a recommender problem,
then. Really, this is neither a clustering nor recommender problem;
it's just using a similarity metric.

Ted's right that you do have to encode these things as numeric vectors
to use anything in Mahout. A vector can't have values like "cat" --
how far is that from 2 or "dog"? You have to encode that first. There
are utilities to help encode categorical data though.

But then you could easily make use of the bits from the recommender
code if you wanted to find nearest neighbor, or reuse a bit of
clustering on Hadoop that calculates similarities. (In fact there are
bits of the recommender on Hadoop that calculate similarities too --
plenty of pieces you could co-opt to just create similarities to one
item.)

If you want to write a similarity function that works on what are
really Map<?,?> objects, you could easily do that. You're probably not
really using Mahout anymore at that stage. But it's simple enough that
you could do it. After writing the similarity metric, the rest of the
code is one for-loop.

On Tue, Oct 11, 2011 at 6:57 AM, Felix Filozov <ffilozov@gmail.com> wrote:
> Ted, does this apply to recommenders?
>
> Let me describe my problem more simply: Imagine you have a set of N feature
> vectors, and you're given a vector X (not in the set of N), and you're asked
> to find a vector in N which is nearest to X. I believe this is a classic
> description of NN.
>
> I've been making my way through Mahout in Action (I just realized you guys
> are the authors; great book!) and some online tutorials, and it seems to me
> that I'd have to do quite a lot of shoehorning to achieve my goal. I don't
> really have a notion of a user or a rating, similarity would have to be
> defined by me, and any further optimization using Kd-trees or LSH could be
> difficult.
>
>
> On Mon, Oct 10, 2011 at 9:25 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>> You need to encode these as numerical vectors.
>>
>> The classes in org.apache.mahout.vectorizer.encoders can help converting
>> combined numerical, categorical and textual fields into a coherent vector
>> that can be used with standard distance measures.
>>
>> On Mon, Oct 10, 2011 at 11:54 PM, Felix Filozov <ffilozov@gmail.com>
>> wrote:
>>
>> > I have a set of feature vectors. They're composed of integers and other
>> > non-numerical values. This means that I would need the ability to supply
>> my
>> > own distance function. My data has no notion of users, just vectors.
>> >
>> > Example:
>> >
>> > vector 1: (1, apple, dog, 34, 8766)
>> > ...
>> > vector n: (3, orange, cat, 3738, 3737)
>> >
>> > I would like to know if Mahout can perform kNN similarity search using
>> such
>> > arbitrary items/vectors. As a side question, can it  perform that outside
>> > the context of a recommender? I think reducing some problems to a
>> > recommendation may a bit awkward.
>> >
>> >
>> >
>> > On Monday, October 10, 2011, Sean Owen <srowen@gmail.com> wrote:
>> > > I think there are a lot of answers to this, depending on what exactly
>> > > you want. This is just one answer -- maybe you can clarify your
>> > > requirements.
>> > >
>> > > You want to just find the k most similar items, and you want to
>> > > construe this as a recommender problem?
>> > > The item-based recommenders have a mostSimilarItems() method. All it
>> > > does is find the k most similar items to the given item. It's just
>> > > applying a given similarity metric to search all possibilities. It
>> > > works on "items" but you can flip it around to work on users if you
>> > > like.
>> > >
>> > > Vectors really have to take on numeric values, or else they're not
>> > > really vectors! Are you trying to map discrete values to some numeric
>> > > range?
>> > >
>> > >
>> > > On Mon, Oct 10, 2011 at 8:26 PM, Felix Filozov <ffilozov@gmail.com>
>> > wrote:
>> > >> I would like perform a kNN similarity search, where each data point
is
>> a
>> > N
>> > >> dimensional vector and each coordinate in the vector may take on any
>> > value
>> > >> (reals or strings). It seems to me that Mahout doesn't have the
>> ability
>> > to
>> > >> perform a generic kNN similarity search, instead the problem has to
be
>> > >> mapped to a recommender. Is Mahout the right tool for this task?
>> > >>
>> > >> If it is, how have you dealt with the mapping, and if not, what would
>> > you
>> > >> recommend?
>> > >>
>> > >> Thanks.
>> > >>
>> > >> Felix
>> > >>
>> > >
>> >
>>
>

Mime
View raw message