Hi,
I need to build a system to determine N (i.e. 10) most similar documents to
a given document. I have some (theoretical) knowledge of Mahout algorithms,
but not enough to build the system. Can you give me some suggestions?
At first I was researching Latent Semantic Analysis for the task, but since
Mahout doesn't support it, I started researching some other options. I got
a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation)
in Mahout to achieve similar and even better results.
However ... and this is where I got confused ... LDA is a clustering
algorithm. However, what I need is not to cluster the documents into N
clusters  I need to get a matrix (similar to TFIDF) from which I can
calculate some sort of a distance for any two documents to get N most
similar documents for any given document.
How do I achieve that? My idea was (still mostly theoretical, since I have
some problems with running the LDA algorithm) to extract some number of
topics with LDA, but I need not cluster the documents with the help of this
topics, but to get the matrix of documents as one dimention and topics as
the other dimension. I was guessing I could then use this matrix an an
input to rowsimilarity algorithm.
Is this the correct concept? Or am I missing something?
And, since LDA is not supperted on Spark/Samsara, how could I achieve
similar results on Spark?
Thanks in advance,
David
