I could try to make the similarity computation work on the rows of a
DistributedRowMatrix with several metrices (similar to
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob) and would
concentrate on the implementation of it as a mathematical operation not
specific to any domain.
So it would be left up to the users to convert their documents to vectors
and maybe do things like stemming or stopword removal to reduce the
computation overhead, when they use this job for text documents.
I could start working on that in 2 weeks from now though. Tell me if that's
welcomed and I'll go and create a jira issue :)
sebastian
2010/6/9 Jake Mannix <jake.mannix@gmail.com>
> On Tue, Jun 8, 2010 at 4:45 PM, Sebastian Schelter
> <ssc.open@googlemail.com>wrote:
>
> > The relation between these two problems (document similarity and item
> > similarity in CF) is exactly like Sean pointed out: In the paper a
> document
> > is a vector of term frequencies and the paper shows how to compute the
> > pairwise similarities between those. To use this for collaborative
> > filtering
> > you actually just have to replace the document with an item which is a
>
> vector of user preferences.
> >
>
> Yep, a vector is a vector is a vector. (And when you're me, even if you
> are *not* a vector, you might be a vector. ;) )
>
>
> > It shouldn't be too hard to make this work on a DistributedRowMatrix too,
> I
> > think. You already mentioned you wanna have it that way some time
> > in MAHOUT362 :)
> >
>
> Well indeed I did!
>
> jake
>
