lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Barry Coughlan <b.coughl...@gmail.com>
Subject Re: How best to compare tow sentences
Date Wed, 03 Dec 2014 16:48:54 GMT
There are various implementations of Damerau-Levenshtein online. I don't
know how much it will improve your results however.

Why are you not indexing all of the strings? If you don't have to compute
all possible pairs, then you are better off without Lucene.

Note that the cosine similarity calculation that Lucene performs is based
on TF-IDF values (documented here:
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
For very short strings TF-IDF is very noisy. It doesn't make a whole lot of
sense when you only have 2 documents. I suspect that the simmetrics
CosineSimilarity and JaccardSimilarity will give you better results..

My suggestion is to first correct for typos by calculating the Levenshtein
Distance for each word in songtitle1 against each word in songtitle2, and
assume that words are the same if the distance is less than 2. Then use
JaccardSimilarity or CosineSimilarity to calculate the similarity.

Barry



On Wed, Dec 3, 2014 at 3:45 PM, Paul Taylor <paul_t100@fastmail.fm> wrote:

> On 03/12/2014 15:14, Barry Coughlan wrote:
>
>> Hi Paul,
>>
>> I don't have much expertise in this area so hopefully others will answer,
>> but maybe this is better than nothing.
>>
>> I don't know many out-of-the-box solutions for this problem, but I'm sure
>> they exist. Mahout and Carrot2 might be worth investigating.
>>
>> Similarity Metrics:
>> - Jaccard Index. Measures similarity between two sets, so word order is
>> not important.
>> - Levenshtein distance. If you are dealing with user-inputted typos, the
>> Damerau–Levenshtein distance can perform a bit better because it takes into
>> account swapping adjacent letters (e.g. teh -> the).
>>
>> I worked with some code that did this for author names e.g. merge "Barack
>> Obama", "Obama B." and "B. H. Obama". It used a combination of
>> Damerau–Levenshtein distance and Jaccard index. It worked very well for
>> this problem, but unfortunately the code was sparse on documentation and
>> full of magic numbers so I don't know the details. The approach was similar
>> to the approach described in this answer: http://stackoverflow.com/a/
>> 11920867/281469
>>
>> This is an O(n^2) pairwise comparison problem. As your data gets bigger
>> you have to work around this limitation. This problem is known in research
>> literature as the "all-pairs" similarity problem. The paper linked from
>> this repository is a good read on the subject: https://code.google.com/p/
>> google-all-pairs-similarity-search/
>>
>> One of the ways you can work around this is by using Lucene to limit the
>> amount of comparisons you need to do:
>> - Index your documents.
>> - For each song title do a fuzzy search of the words.
>> - Take the top N results, calculate their similarity with the song title
>> using the metrics (or just use the Lucene score).
>> - Cluster similar titles by song title.
>>
>> This is basically creating a sparse inverted index of your document
>> matrix, so that you can find results that will produce non-zero
>> similarities. This is the most effective way of optimizing performance that
>> I have encountered.
>>
>> Again, I'm sure there are out-of-the-box solutions for this problem, but
>> I don't know what they are.
>>
>> Hope that helps,
>> Barry
>>
>>  Thankyou barry I wil spend some time going through your suggestions, in
> the library Im looking at I dont seem to have Damerau–Levenshtein but I do
> have jaccardSimilarity so if that understands words Ill will give it a try.
>
> |BlockDistance
> ChapmanLengthDeviation
> ChapmanMatchingSoundex
> ChapmanMeanLength
> ChapmanOrderedNameCompoundSimilarity
> CosineSimilarity
> DiceSimilarity
> EuclideanDistance
> InterfaceStringMetric
> JaccardSimilarity
> Jaro
> JaroWinkler
> Levenshtein
> MatchingCoefficient
> MongeElkan
> NeedlemanWunch
> OverlapCoefficient
> QGramsDistance
> SmithWaterman
> SmithWatermanGotoh
> SmithWatermanGotohWindowedAffine
> Soundex
> |
> One things, regaridng your lucene based solution I think you have missed
> an important point. I am only comparing TWO strings at any time, I dont
> have a dataset of thousands of sentences that I want to compare over time I
> literally have string a and string b and I just want to compare those, at a
> later date Ill have string c and d, but at no point do I have strings
> a,b,c,d. I'm not trying to find the best  matching string for a single
> title just is this String a good match for this song title.
>
> Paul
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message