mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Tue, 08 Jun 2010 23:16:16 GMT
On Tue, Jun 8, 2010 at 3:56 PM, Olivier Grisel <olivier.grisel@ensta.org>wrote:

> 2010/6/8 Jake Mannix <jake.mannix@gmail.com>:
> > Hi Kris,
> >
> >  If you generate a full document-document similarity matrix offline, and
> > then make sure to sparsify the rows (trim off all similarities below a
> > threshold, or only take the top N for each row, etc...).  Then encoding
> > these values directly in the index would indeed allow for *superfast*
> > MoreLikeThis functionality, because you've already computed all
> > of the similar results offline.
>
> For 10e6 documents if might not be reasonable to generate the complete
> document-document similarity matrix: 1e12 components => a couple of
> tera bytes of similarity values just to find the find the top N
> afterwards:


Nope, this isn't what happens in what I described: when you take
a sparseDocumentMatrix.transpose().times(itself), the scaling does
not go N^2*M, with N^2 outputs - the calculation is sparse, only
computing the entries which are nonzero.  If you pre-sparsify the
documents a little (remove all terms which occur in more than
1% of all documents, or something like that), this sparse calculation
is even faster - it scales as sum_{i=1...N}(k_i)^2, where k_i is the
number of nonzero elements in document i.  If all documents were
the same length (k), then this scales as N*k^2, and the total
number of nonzero entries in the output is far less than N^2 if
k << N,M.

Getting rid of the common terms (even *lots* of them) beforehand
is still a very good idea.

   -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message