mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Earl <charlesce...@me.com>
Subject Re: Document similarity
Date Fri, 11 Mar 2016 14:19:23 GMT
You might also look at this paper by Wallach
http://maroo.cs.umass.edu/pub/web/getpdf.php?id=1101

Sent from my iPhone

> On Mar 11, 2016, at 8:11 AM, David Starina <david.starina@gmail.com> wrote:
> 
> Well, there is also an online method of LDA in Spark ... Pat, is there any
> documentation on the method you described?
> 
>> On Wed, Feb 24, 2016 at 6:10 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
>> 
>> The method I described calculates similarity on the fly but requires new
>> docs to go through feature extraction before similarity can be queried. The
>> length of time to do feature extraction is short compared to training LDA.
>> 
>> Another method that gets at semantic similarity uses adaptive skip-grams
>> for text features. http://arxiv.org/abs/1502.07257 I haven’t tried this
>> but a friend saw a presentation about using this method to create features
>> for a search engine which showed a favorable comparison with word2vec.
>> 
>> If you want to use LDA note that it is an unsupervised categorization
>> method. To use it, the cluster descriptors (a vector of important terms)
>> can be compared to the analyzed incoming document using a KNN/search
>> engine. This will give you a list of the closest clusters but doesn’t
>> really give you documents, which is your goal I think. LDA should be re-run
>> periodically to generate new clusters. Do you want to know cluster
>> inclusion or get a list of similar docs?
>> 
>> On Feb 23, 2016, at 1:01 PM, David Starina <david.starina@gmail.com>
>> wrote:
>> 
>> Guys, one more question ... Are there some incremental methods to do this?
>> I don't want to run the whole job again once a new document is added. In
>> case of LDA ... I guess the best way is to calculate the topics on the new
>> document using the topics from the previous LDA run ... And then every once
>> in a while to recalculate the topics with the new documents?
>> 
>> On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel <pat@occamsmachete.com>
>> wrote:
>> 
>>> Something we are working on for purely content based similarity is using
>> a
>>> KNN engine (search engine) but creating features from word2vec and an NER
>>> (Named Entity Recognizer).
>>> 
>>> putting the generated features into fields of a doc can really help with
>>> similarity because w2v and NER create semantic features. You can also try
>>> n-grams or skip-grams. These features are not very helpful for search but
>>> for  similarity they work well.
>>> 
>>> The query to the KNN engine is a document, each field mapped to the
>>> corresponding field of the index. The result is the k nearest neighbors
>> to
>>> the query doc.
>>> 
>>> 
>>>>> On Feb 14, 2016, at 11:05 AM, David Starina <david.starina@gmail.com>
>>>> wrote:
>>>> 
>>>> Charles, thank you, I will check that out.
>>>> 
>>>> Ted, I am looking for semantic similarity. Unfortunately, I do not have
>>> any
>>>> data on the usage of the documents (if by usage you mean user behavior).
>>>> 
>>>>> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <ted.dunning@gmail.com>
>>>> wrote:
>>>> 
>>>>> Did you want textual similarity?
>>>>> 
>>>>> Or semantic similarity?
>>>>> 
>>>>> The actual semantics of a message can be opaque from the content, but
>>> clear
>>>>> from the usage.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <charlescearl@me.com>
>>> wrote:
>>>>> 
>>>>>> David,
>>>>>> LDA or LSI can work quite nicely for similarity (YMMV of course
>>> depending
>>>>>> on the characterization of your documents).
>>>>>> You basically use the dot product of the square roots of the vectors
>>> for
>>>>>> LDA -- if you do a search for Hellinger or Bhattachararyya distance
>>> that
>>>>>> will lead you to a good similarity or distance measure.
>>>>>> As I recall, Spark does provide an LDA implementation. Gensim provides
>>> a
>>>>>> API for doing LDA similarity out of the box. Vowpal Wabbit is also
>>> worth
>>>>>> looking at, particularly for a large dataset.
>>>>>> Hope this is useful.
>>>>>> Cheers
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>>>> On Feb 14, 2016, at 8:14 AM, David Starina <david.starina@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I need to build a system to determine N (i.e. 10) most similar
>>>>> documents
>>>>>> to
>>>>>>> a given document. I have some (theoretical) knowledge of Mahout
>>>>>> algorithms,
>>>>>>> but not enough to build the system. Can you give me some suggestions?
>>>>>>> 
>>>>>>> At first I was researching Latent Semantic Analysis for the task,
but
>>>>>> since
>>>>>>> Mahout doesn't support it, I started researching some other options.
>> I
>>>>>> got
>>>>>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet
>>>>> allocation)
>>>>>>> in Mahout to achieve similar and even better results.
>>>>>>> 
>>>>>>> However ... and this is where I got confused ... LDA is a clustering
>>>>>>> algorithm. However, what I need is not to cluster the documents
into
>> N
>>>>>>> clusters - I need to get a matrix (similar to TF-IDF) from which
I
>> can
>>>>>>> calculate some sort of a distance for any two documents to get
N most
>>>>>>> similar documents for any given document.
>>>>>>> 
>>>>>>> How do I achieve that? My idea was (still mostly theoretical,
since I
>>>>>> have
>>>>>>> some problems with running the LDA algorithm) to extract some
number
>>> of
>>>>>>> topics with LDA, but I need not cluster the documents with the
help
>> of
>>>>>> this
>>>>>>> topics, but to get the matrix of documents as one dimention and
>> topics
>>>>> as
>>>>>>> the other dimension. I was guessing I could then use this matrix
an
>> an
>>>>>>> input to row-similarity algorithm.
>>>>>>> 
>>>>>>> Is this the correct concept? Or am I missing something?
>>>>>>> 
>>>>>>> And, since LDA is not supperted on Spark/Samsara, how could I
achieve
>>>>>>> similar results on Spark?
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks in advance,
>>>>>>> David
>> 
>> 

Mime
View raw message