mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Tue, 08 Jun 2010 23:08:52 GMT
Sort of, there is a separate job to compute all item-item similarities
under a variety of metrics. This is what Sebastian wrote. It's not
used in the co-occurrence recommender (but could be -- vaguely a to-do
here.)

But sure if you're willing to think of a doc as an "item vector" of
"preferences" from "words" then this works fine to compute doc
similarity under these metrics.

On Wed, Jun 9, 2010 at 12:52 AM, Jake Mannix <jake.mannix@gmail.com> wrote:
> The code in mahout CF is doing that?  I don't think that's right, we don't
> do anything that fancy right now, do we Sean?
>
>  -jake
>
> On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter
> <ssc.open@googlemail.com>wrote:
>
>> Hi Kris,
>>
>> actually the code to compute the item-to-item similarities in the
>> collaborative filtering part of mahout (which at the first look seems to be
>> a totally different problem than yours) is based on a paper that deals with
>> computing the pairwise similarity of text documents in a very simple way.
>> Maybe that  could be helpful to you:
>>
>> Elsayed et al: Pairwise Document Similarity in Large Collections with
>> MapReduce
>>
>> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
>> <
>> http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf
>> >
>>
>> -sebastian
>>
>>
>> 2010/6/8 Kris Jack <mrkrisjack@gmail.com>
>>
>> > Hi everyone,
>> >
>> > I currently use lucene's moreLikeThis function through solr to find
>> > documents that are related to one another.  A single call, however, takes
>> > around 4 seconds to complete and I would like to reduce this.  I got to
>> > thinking that I might be able to use Mahout to generate a document
>> > similarity matrix offline that could then be looked-up in real time for
>> > serving.  Is this a reasonable use of Mahout?  If so, what functions will
>> > generate a document similarity matrix?  Also, I would like to be able to
>> > keep the text processing advantages provided through lucene so it would
>> > help
>> > if I could still use my lucene index.  If not, then could you recommend
>> any
>> > alternative solutions please?
>> >
>> > Many thanks,
>> > Kris
>> >
>>
>

Mime
View raw message