mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Grisel <>
Subject Re: Generating a Document Similarity Matrix
Date Tue, 08 Jun 2010 22:56:39 GMT
2010/6/8 Jake Mannix <>:
> Hi Kris,
>  If you generate a full document-document similarity matrix offline, and
> then make sure to sparsify the rows (trim off all similarities below a
> threshold, or only take the top N for each row, etc...).  Then encoding
> these values directly in the index would indeed allow for *superfast*
> MoreLikeThis functionality, because you've already computed all
> of the similar results offline.

For 10e6 documents if might not be reasonable to generate the complete
document-document similarity matrix: 1e12 components => a couple of
tera bytes of similarity values just to find the find the top N
afterwards: sorting a tera byte of data can be fast when you have a
datacenter like yahoos or googles but might not be reasonable when you
just have a CMS running on a couple of servers :)

Trimming off low similarities should happen before starting to writer
the rows on the hard drive.

Olivier -

View raw message