lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Need Help: Business Scenario to lucene implementation
Date Thu, 01 Sep 2011 13:14:07 GMT
I'd probably treat this as a deduplication problem and look to use a fuzzy matching approach,
such as the TextProfileSignature in Solr/Nutch:,
which I believe is tunable as to it's threshold of acceptance.

I'd also likely give pushback on the notion of 50% for a bit more clarification.  Does it
mean 50% of all words (pre or post analysis?  Stemming or not?) or 50% of "important words"
(which is more or less what More Like This will do.)  You might also do a little bit of research
into academia here, as there is a fair amount of work that has gone into this area along the
lines of detecting plagiarism, etc.   Finally, one might be able to instead treat this as
a classification problem and train a model to detect dupes or not.

On Aug 30, 2011, at 12:55 PM, Saurabh Gokhale wrote:

> Hi All,
> I need your help to understand how I can have Lucene applied to the
> following business scenario. Question is in RED
> *Business Scenario:*
> Analyze newly created document "A" with existing documents in the system and
> if document A matches more than (similar to) 50% with any of the existing
> documents, perform specific action.
> *Possible Lucene Implementation:*
> Requirement: Analyze newly created document A
> Action: Read name and the contents of the document A
> Requirement: Analyze new document with existing documents in the system
> Action: 1. Pre Index all the existing document and create lucene index. 2.
> Use class like MoreLikeThis to find similar documents for newly created
> document.
> Requirement: If match is above 50%, perform specific action
> Action: Since resulting lucene score for the match can not be directly
> converted into a percentage match (as the score value changes based on many
> factors) how can this requirement be satisfied?
> Thanks
> Saurabh

Grant Ingersoll
Lucene Eurocon 2011:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message