mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Earl <>
Subject Re: Document similarity
Date Fri, 11 Mar 2016 14:19:23 GMT
You might also look at this paper by Wallach

Sent from my iPhone

> On Mar 11, 2016, at 8:11 AM, David Starina <> wrote:
> Well, there is also an online method of LDA in Spark ... Pat, is there any
> documentation on the method you described?
>> On Wed, Feb 24, 2016 at 6:10 PM, Pat Ferrel <> wrote:
>> The method I described calculates similarity on the fly but requires new
>> docs to go through feature extraction before similarity can be queried. The
>> length of time to do feature extraction is short compared to training LDA.
>> Another method that gets at semantic similarity uses adaptive skip-grams
>> for text features. I haven’t tried this
>> but a friend saw a presentation about using this method to create features
>> for a search engine which showed a favorable comparison with word2vec.
>> If you want to use LDA note that it is an unsupervised categorization
>> method. To use it, the cluster descriptors (a vector of important terms)
>> can be compared to the analyzed incoming document using a KNN/search
>> engine. This will give you a list of the closest clusters but doesn’t
>> really give you documents, which is your goal I think. LDA should be re-run
>> periodically to generate new clusters. Do you want to know cluster
>> inclusion or get a list of similar docs?
>> On Feb 23, 2016, at 1:01 PM, David Starina <>
>> wrote:
>> Guys, one more question ... Are there some incremental methods to do this?
>> I don't want to run the whole job again once a new document is added. In
>> case of LDA ... I guess the best way is to calculate the topics on the new
>> document using the topics from the previous LDA run ... And then every once
>> in a while to recalculate the topics with the new documents?
>> On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel <>
>> wrote:
>>> Something we are working on for purely content based similarity is using
>> a
>>> KNN engine (search engine) but creating features from word2vec and an NER
>>> (Named Entity Recognizer).
>>> putting the generated features into fields of a doc can really help with
>>> similarity because w2v and NER create semantic features. You can also try
>>> n-grams or skip-grams. These features are not very helpful for search but
>>> for  similarity they work well.
>>> The query to the KNN engine is a document, each field mapped to the
>>> corresponding field of the index. The result is the k nearest neighbors
>> to
>>> the query doc.
>>>>> On Feb 14, 2016, at 11:05 AM, David Starina <>
>>>> wrote:
>>>> Charles, thank you, I will check that out.
>>>> Ted, I am looking for semantic similarity. Unfortunately, I do not have
>>> any
>>>> data on the usage of the documents (if by usage you mean user behavior).
>>>>> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <>
>>>> wrote:
>>>>> Did you want textual similarity?
>>>>> Or semantic similarity?
>>>>> The actual semantics of a message can be opaque from the content, but
>>> clear
>>>>> from the usage.
>>>>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <>
>>> wrote:
>>>>>> David,
>>>>>> LDA or LSI can work quite nicely for similarity (YMMV of course
>>> depending
>>>>>> on the characterization of your documents).
>>>>>> You basically use the dot product of the square roots of the vectors
>>> for
>>>>>> LDA -- if you do a search for Hellinger or Bhattachararyya distance
>>> that
>>>>>> will lead you to a good similarity or distance measure.
>>>>>> As I recall, Spark does provide an LDA implementation. Gensim provides
>>> a
>>>>>> API for doing LDA similarity out of the box. Vowpal Wabbit is also
>>> worth
>>>>>> looking at, particularly for a large dataset.
>>>>>> Hope this is useful.
>>>>>> Cheers
>>>>>> Sent from my iPhone
>>>>>>>> On Feb 14, 2016, at 8:14 AM, David Starina <>
>>>>>>> wrote:
>>>>>>> Hi,
>>>>>>> I need to build a system to determine N (i.e. 10) most similar
>>>>> documents
>>>>>> to
>>>>>>> a given document. I have some (theoretical) knowledge of Mahout
>>>>>> algorithms,
>>>>>>> but not enough to build the system. Can you give me some suggestions?
>>>>>>> At first I was researching Latent Semantic Analysis for the task,
>>>>>> since
>>>>>>> Mahout doesn't support it, I started researching some other options.
>> I
>>>>>> got
>>>>>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet
>>>>> allocation)
>>>>>>> in Mahout to achieve similar and even better results.
>>>>>>> However ... and this is where I got confused ... LDA is a clustering
>>>>>>> algorithm. However, what I need is not to cluster the documents
>> N
>>>>>>> clusters - I need to get a matrix (similar to TF-IDF) from which
>> can
>>>>>>> calculate some sort of a distance for any two documents to get
N most
>>>>>>> similar documents for any given document.
>>>>>>> How do I achieve that? My idea was (still mostly theoretical,
since I
>>>>>> have
>>>>>>> some problems with running the LDA algorithm) to extract some
>>> of
>>>>>>> topics with LDA, but I need not cluster the documents with the
>> of
>>>>>> this
>>>>>>> topics, but to get the matrix of documents as one dimention and
>> topics
>>>>> as
>>>>>>> the other dimension. I was guessing I could then use this matrix
>> an
>>>>>>> input to row-similarity algorithm.
>>>>>>> Is this the correct concept? Or am I missing something?
>>>>>>> And, since LDA is not supperted on Spark/Samsara, how could I
>>>>>>> similar results on Spark?
>>>>>>> Thanks in advance,
>>>>>>> David

View raw message