mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <ssc.o...@googlemail.com>
Subject Re: Generating a Document Similarity Matrix
Date Tue, 08 Jun 2010 23:45:17 GMT
The relation between these two problems (document similarity and item
similarity in CF) is exactly like Sean pointed out: In the paper a document
is a vector of term frequencies and the paper shows how to compute the
pairwise similarities between those. To use this for collaborative filtering
you actually just have to replace the document with an item which is a
vector of user preferences.

It shouldn't be too hard to make this work on a DistributedRowMatrix too, I
think. You already mentioned you wanna have it that way some time
in MAHOUT-362 :)

-sebastian

2010/6/9 Jake Mannix <jake.mannix@gmail.com>

> Ah yes.  I would love for us to have an implementation of that pairwise
> similarity
> code.  It would be useful for lots of things in Mahout, yes!
>
>  -jake
>
> On Tue, Jun 8, 2010 at 4:21 PM, Sebastian Schelter
> <ssc.open@googlemail.com>wrote:
>
> > I did not wanna say you can use the item-item-similarity code from CF for
> > computing the document similarities, I just wanted to point out that
> these
> > problems are closely related and that the paper which the CF code is
> based
> > on is dealing with the computation of pairwise document similarities and
> > could therefore be helpful.
> >
> > -sebastian
> >
> > 2010/6/9 Jake Mannix <jake.mannix@gmail.com>
> >
> > > The code in mahout CF is doing that?  I don't think that's right, we
> > don't
> > > do anything that fancy right now, do we Sean?
> > >
> > >  -jake
> > >
> > > On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter
> > > <ssc.open@googlemail.com>wrote:
> > >
> > > > Hi Kris,
> > > >
> > > > actually the code to compute the item-to-item similarities in the
> > > > collaborative filtering part of mahout (which at the first look seems
> > to
> > > be
> > > > a totally different problem than yours) is based on a paper that
> deals
> > > with
> > > > computing the pairwise similarity of text documents in a very simple
> > way.
> > > > Maybe that  could be helpful to you:
> > > >
> > > > Elsayed et al: Pairwise Document Similarity in Large Collections with
> > > > MapReduce
> > > >
> > > >
> > >
> >
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> > > > <
> > > >
> > >
> >
> http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> > > > >
> > > >
> > > > -sebastian
> > > >
> > > >
> > > > 2010/6/8 Kris Jack <mrkrisjack@gmail.com>
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I currently use lucene's moreLikeThis function through solr to find
> > > > > documents that are related to one another.  A single call, however,
> > > takes
> > > > > around 4 seconds to complete and I would like to reduce this.  I
> got
> > to
> > > > > thinking that I might be able to use Mahout to generate a document
> > > > > similarity matrix offline that could then be looked-up in real time
> > for
> > > > > serving.  Is this a reasonable use of Mahout?  If so, what
> functions
> > > will
> > > > > generate a document similarity matrix?  Also, I would like to be
> able
> > > to
> > > > > keep the text processing advantages provided through lucene so it
> > would
> > > > > help
> > > > > if I could still use my lucene index.  If not, then could you
> > recommend
> > > > any
> > > > > alternative solutions please?
> > > > >
> > > > > Many thanks,
> > > > > Kris
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message