lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashi Kant <sk...@sloan.mit.edu>
Subject Re: How best to compare tow sentences
Date Wed, 03 Dec 2014 15:55:21 GMT
Paul, for a pair-wise comparison Cosine Similarity does pretty good
for most purposes.

On Wed, Dec 3, 2014 at 10:45 AM, Paul Taylor <paul_t100@fastmail.fm> wrote:
> On 03/12/2014 15:14, Barry Coughlan wrote:
>>
>> Hi Paul,
>>
>> I don't have much expertise in this area so hopefully others will answer,
>> but maybe this is better than nothing.
>>
>> I don't know many out-of-the-box solutions for this problem, but I'm sure
>> they exist. Mahout and Carrot2 might be worth investigating.
>>
>> Similarity Metrics:
>> - Jaccard Index. Measures similarity between two sets, so word order is
>> not important.
>> - Levenshtein distance. If you are dealing with user-inputted typos, the
>> Damerau–Levenshtein distance can perform a bit better because it takes into
>> account swapping adjacent letters (e.g. teh -> the).
>>
>> I worked with some code that did this for author names e.g. merge "Barack
>> Obama", "Obama B." and "B. H. Obama". It used a combination of
>> Damerau–Levenshtein distance and Jaccard index. It worked very well for this
>> problem, but unfortunately the code was sparse on documentation and full of
>> magic numbers so I don't know the details. The approach was similar to the
>> approach described in this answer:
>> http://stackoverflow.com/a/11920867/281469
>>
>> This is an O(n^2) pairwise comparison problem. As your data gets bigger
>> you have to work around this limitation. This problem is known in research
>> literature as the "all-pairs" similarity problem. The paper linked from this
>> repository is a good read on the subject:
>> https://code.google.com/p/google-all-pairs-similarity-search/
>>
>> One of the ways you can work around this is by using Lucene to limit the
>> amount of comparisons you need to do:
>> - Index your documents.
>> - For each song title do a fuzzy search of the words.
>> - Take the top N results, calculate their similarity with the song title
>> using the metrics (or just use the Lucene score).
>> - Cluster similar titles by song title.
>>
>> This is basically creating a sparse inverted index of your document
>> matrix, so that you can find results that will produce non-zero
>> similarities. This is the most effective way of optimizing performance that
>> I have encountered.
>>
>> Again, I'm sure there are out-of-the-box solutions for this problem, but I
>> don't know what they are.
>>
>> Hope that helps,
>> Barry
>>
> Thankyou barry I wil spend some time going through your suggestions, in the
> library Im looking at I dont seem to have Damerau–Levenshtein but I do have
> jaccardSimilarity so if that understands words Ill will give it a try.
>
> |BlockDistance
> ChapmanLengthDeviation
> ChapmanMatchingSoundex
> ChapmanMeanLength
> ChapmanOrderedNameCompoundSimilarity
> CosineSimilarity
> DiceSimilarity
> EuclideanDistance
> InterfaceStringMetric
> JaccardSimilarity
> Jaro
> JaroWinkler
> Levenshtein
> MatchingCoefficient
> MongeElkan
> NeedlemanWunch
> OverlapCoefficient
> QGramsDistance
> SmithWaterman
> SmithWatermanGotoh
> SmithWatermanGotohWindowedAffine
> Soundex
> |
> One things, regaridng your lucene based solution I think you have missed an
> important point. I am only comparing TWO strings at any time, I dont have a
> dataset of thousands of sentences that I want to compare over time I
> literally have string a and string b and I just want to compare those, at a
> later date Ill have string c and d, but at no point do I have strings
> a,b,c,d. I'm not trying to find the best  matching string for a single title
> just is this String a good match for this song title.
>
> Paul



-- 
skant@alum.mit.edu
(617) 595-5946

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message