lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Beto Siless <>
Subject Re: near duplicates
Date Tue, 24 Oct 2006 14:23:23 GMT
Hi Andrej!

I'm taking a look to fuzzy signatures for near duplicate detection and 
and I have seen your TextProfileSignature. The question is: If I index 
the documents with their text signature, is there a way to filter near 
duplicates at search time without comparing each document with all other?


Andrzej Bialecki wrote:
> karl wettin wrote:
>> 17 okt 2006 kl. 17.54 skrev Find Me:
>>> How to eliminate near duplicates from the index?
>> I would probably try to measure the Ecludian distance between all 
>> documents, computed on terms and their positions. Or perhaps use 
>> standard deviation to find the distribution of terms in a document. 
>> One would based on the output from that try to find a threashold. 
>> Either way it will consume lots of CPU.
> There are better ways to achieve this. You need to create a fuzzy 
> signature of the document, based on term histogram or shingles - take a 
> look a the Signature framework in Nutch.
> There is a substantial literature on this subject - go to Citeseer and 
> run a search for "near duplicate detection".

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message