On Tue, Jun 8, 2010 at 3:56 PM, Olivier Grisel <olivier.grisel@ensta.org>wrote:
> 2010/6/8 Jake Mannix <jake.mannix@gmail.com>:
> > Hi Kris,
> >
> > If you generate a full documentdocument similarity matrix offline, and
> > then make sure to sparsify the rows (trim off all similarities below a
> > threshold, or only take the top N for each row, etc...). Then encoding
> > these values directly in the index would indeed allow for *superfast*
> > MoreLikeThis functionality, because you've already computed all
> > of the similar results offline.
>
> For 10e6 documents if might not be reasonable to generate the complete
> documentdocument similarity matrix: 1e12 components => a couple of
> tera bytes of similarity values just to find the find the top N
> afterwards:
Nope, this isn't what happens in what I described: when you take
a sparseDocumentMatrix.transpose().times(itself), the scaling does
not go N^2*M, with N^2 outputs  the calculation is sparse, only
computing the entries which are nonzero. If you presparsify the
documents a little (remove all terms which occur in more than
1% of all documents, or something like that), this sparse calculation
is even faster  it scales as sum_{i=1...N}(k_i)^2, where k_i is the
number of nonzero elements in document i. If all documents were
the same length (k), then this scales as N*k^2, and the total
number of nonzero entries in the output is far less than N^2 if
k << N,M.
Getting rid of the common terms (even *lots* of them) beforehand
is still a very good idea.
jake
