mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Document similarity
Date Sun, 14 Feb 2016 21:02:34 GMT
Something we are working on for purely content based similarity is using a KNN engine (search
engine) but creating features from word2vec and an NER (Named Entity Recognizer).

putting the generated features into fields of a doc can really help with similarity because
w2v and NER create semantic features. You can also try n-grams or skip-grams. These features
are not very helpful for search but for  similarity they work well.

The query to the KNN engine is a document, each field mapped to the corresponding field of
the index. The result is the k nearest neighbors to the query doc.

> On Feb 14, 2016, at 11:05 AM, David Starina <> wrote:
> Charles, thank you, I will check that out.
> Ted, I am looking for semantic similarity. Unfortunately, I do not have any
> data on the usage of the documents (if by usage you mean user behavior).
> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <> wrote:
>> Did you want textual similarity?
>> Or semantic similarity?
>> The actual semantics of a message can be opaque from the content, but clear
>> from the usage.
>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <> wrote:
>>> David,
>>> LDA or LSI can work quite nicely for similarity (YMMV of course depending
>>> on the characterization of your documents).
>>> You basically use the dot product of the square roots of the vectors for
>>> LDA -- if you do a search for Hellinger or Bhattachararyya distance that
>>> will lead you to a good similarity or distance measure.
>>> As I recall, Spark does provide an LDA implementation. Gensim provides a
>>> API for doing LDA similarity out of the box. Vowpal Wabbit is also worth
>>> looking at, particularly for a large dataset.
>>> Hope this is useful.
>>> Cheers
>>> Sent from my iPhone
>>>> On Feb 14, 2016, at 8:14 AM, David Starina <>
>>> wrote:
>>>> Hi,
>>>> I need to build a system to determine N (i.e. 10) most similar
>> documents
>>> to
>>>> a given document. I have some (theoretical) knowledge of Mahout
>>> algorithms,
>>>> but not enough to build the system. Can you give me some suggestions?
>>>> At first I was researching Latent Semantic Analysis for the task, but
>>> since
>>>> Mahout doesn't support it, I started researching some other options. I
>>> got
>>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet
>> allocation)
>>>> in Mahout to achieve similar and even better results.
>>>> However ... and this is where I got confused ... LDA is a clustering
>>>> algorithm. However, what I need is not to cluster the documents into N
>>>> clusters - I need to get a matrix (similar to TF-IDF) from which I can
>>>> calculate some sort of a distance for any two documents to get N most
>>>> similar documents for any given document.
>>>> How do I achieve that? My idea was (still mostly theoretical, since I
>>> have
>>>> some problems with running the LDA algorithm) to extract some number of
>>>> topics with LDA, but I need not cluster the documents with the help of
>>> this
>>>> topics, but to get the matrix of documents as one dimention and topics
>> as
>>>> the other dimension. I was guessing I could then use this matrix an an
>>>> input to row-similarity algorithm.
>>>> Is this the correct concept? Or am I missing something?
>>>> And, since LDA is not supperted on Spark/Samsara, how could I achieve
>>>> similar results on Spark?
>>>> Thanks in advance,
>>>> David

View raw message