mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Earl <>
Subject Re: Document similarity
Date Sun, 14 Feb 2016 13:29:00 GMT
LDA or LSI can work quite nicely for similarity (YMMV of course depending on the characterization
of your documents).
You basically use the dot product of the square roots of the vectors for LDA -- if you do
a search for Hellinger or Bhattachararyya distance that will lead you to a good similarity
or distance measure.
As I recall, Spark does provide an LDA implementation. Gensim provides a API for doing LDA
similarity out of the box. Vowpal Wabbit is also worth looking at, particularly for a large
Hope this is useful.

Sent from my iPhone

> On Feb 14, 2016, at 8:14 AM, David Starina <> wrote:
> Hi,
> I need to build a system to determine N (i.e. 10) most similar documents to
> a given document. I have some (theoretical) knowledge of Mahout algorithms,
> but not enough to build the system. Can you give me some suggestions?
> At first I was researching Latent Semantic Analysis for the task, but since
> Mahout doesn't support it, I started researching some other options. I got
> a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation)
> in Mahout to achieve similar and even better results.
> However ... and this is where I got confused ... LDA is a clustering
> algorithm. However, what I need is not to cluster the documents into N
> clusters - I need to get a matrix (similar to TF-IDF) from which I can
> calculate some sort of a distance for any two documents to get N most
> similar documents for any given document.
> How do I achieve that? My idea was (still mostly theoretical, since I have
> some problems with running the LDA algorithm) to extract some number of
> topics with LDA, but I need not cluster the documents with the help of this
> topics, but to get the matrix of documents as one dimention and topics as
> the other dimension. I was guessing I could then use this matrix an an
> input to row-similarity algorithm.
> Is this the correct concept? Or am I missing something?
> And, since LDA is not supperted on Spark/Samsara, how could I achieve
> similar results on Spark?
> Thanks in advance,
> David

View raw message