lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Lucene Text Similarity
Date Wed, 04 Sep 2013 12:06:59 GMT
I agree with Ivan and Koji.  You also might want to look into MoreLikeThis, which should take
care of finding the highest tf*idf terms for you to use in your query -- http://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html

Best,

       Tim

________________________________________
From: Ivan Krišto [ivan.kristo@gmail.com]
Sent: Wednesday, September 04, 2013 3:17 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene Text Similarity

On 09/03/2013 07:33 PM, David Miranda wrote:

Is there any way to check the similarity of texts with Lucene? I have the
DBpedia indexed and wanted to get the texts more similar between the
abstract and DBpedia another text. If I do a search in the abstract field,
with a particular text the result is not very satisfactory. Eg Abstract
DBpedia: "SoundCloud is an online audio distribution platform Which Allows
collaboration, promotion and distribution of audio recordings." My Text:
"Private Track From DJ Sneak. Download the track now in the SoundCloud
website."


You are attacking extremly hard problem here -- searching short documents
with a long query. This creates a lots of problems, as setting document
frequency of a term to the same magnitude of its own frequency which
instantly kills some similarity measures.

All you can do is to experiment a lot with different similarity measures
and preprocessing steps.

Sim measures are simple, just try them all for each preprocessing
combination.

Suggestions of preprocessing steps:
- remove all stop words
- remove all functional words (you can find list of them at wikipedia)
- boost all uppercase words or words containing at least one uppercase
letter (add boost of 3 or 4; maybe skip first word of a sentence)
- break search text into sentences then search index for each sentence
(combine results using borda count or something similar)
- do what Koji suggested

  Regards,
    Ivan Krišto
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message