lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Lucene Text Similarity
Date Wed, 04 Sep 2013 18:41:30 GMT
If MoreLikeThis doesn't work, you might want to look into Wikipedia Miner:  http://www.nzdl.org/wikification/about.html

http://www.wikipedia-miner.sourceforge.net/

or other wikifiers.

Best,

      Tim

________________________________________
From: David Miranda [david.b.miranda@gmail.com]
Sent: Wednesday, September 04, 2013 1:45 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene Text Similarity

Thanks to all, I will take into account your suggestions.

But I think that should have given the concrete use case. Therefore,
taking into account my first example given, I have the email received
by a user and that email I extract topics of interest to associate the
terms of DBpedia (basically DBpedia documents). The problem here is,
for example Apple, may be fruit or a company (Apple Computers). To
accomplish this disambiguation, I wanted to use the abstract vs. text
of the email to find out what the best term to choose.

Thanks.

2013/9/4 Allison, Timothy B. <tallison@mitre.org>:
> I agree with Ivan and Koji.  You also might want to look into MoreLikeThis, which should
take care of finding the highest tf*idf terms for you to use in your query -- http://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html
>
> Best,
>
>        Tim
>
> ________________________________________
> From: Ivan Krišto [ivan.kristo@gmail.com]
> Sent: Wednesday, September 04, 2013 3:17 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene Text Similarity
>
> On 09/03/2013 07:33 PM, David Miranda wrote:
>
> Is there any way to check the similarity of texts with Lucene? I have the
> DBpedia indexed and wanted to get the texts more similar between the
> abstract and DBpedia another text. If I do a search in the abstract field,
> with a particular text the result is not very satisfactory. Eg Abstract
> DBpedia: "SoundCloud is an online audio distribution platform Which Allows
> collaboration, promotion and distribution of audio recordings." My Text:
> "Private Track From DJ Sneak. Download the track now in the SoundCloud
> website."
>
>
> You are attacking extremly hard problem here -- searching short documents
> with a long query. This creates a lots of problems, as setting document
> frequency of a term to the same magnitude of its own frequency which
> instantly kills some similarity measures.
>
> All you can do is to experiment a lot with different similarity measures
> and preprocessing steps.
>
> Sim measures are simple, just try them all for each preprocessing
> combination.
>
> Suggestions of preprocessing steps:
> - remove all stop words
> - remove all functional words (you can find list of them at wikipedia)
> - boost all uppercase words or words containing at least one uppercase
> letter (add boost of 3 or 4; maybe skip first word of a sentence)
> - break search text into sentences then search index for each sentence
> (combine results using borda count or something similar)
> - do what Koji suggested
>
>   Regards,
>     Ivan Krišto
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



--
Cumprimentos,
David Miranda

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message