mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Grisel <olivier.gri...@ensta.org>
Subject Re: Generating a Document Similarity Matrix
Date Tue, 08 Jun 2010 22:56:39 GMT
2010/6/8 Jake Mannix <jake.mannix@gmail.com>:
> Hi Kris,
>
>  If you generate a full document-document similarity matrix offline, and
> then make sure to sparsify the rows (trim off all similarities below a
> threshold, or only take the top N for each row, etc...).  Then encoding
> these values directly in the index would indeed allow for *superfast*
> MoreLikeThis functionality, because you've already computed all
> of the similar results offline.

For 10e6 documents if might not be reasonable to generate the complete
document-document similarity matrix: 1e12 components => a couple of
tera bytes of similarity values just to find the find the top N
afterwards: sorting a tera byte of data can be fast when you have a
datacenter like yahoos or googles but might not be reasonable when you
just have a CMS running on a couple of servers :)

Trimming off low similarities should happen before starting to writer
the rows on the hard drive.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Mime
View raw message