lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sdeck <>
Subject similarity and delete duplicates
Date Tue, 13 Feb 2007 17:50:46 GMT

Hey everyone.
 I have been trying to get a certain kind of delete duplicates working, but
I need a little help.

Here is my problem.
I have many documents, that after a web crawl, many different sites could
have documents that have similar titles. I want to remove all of those
documents except for 1.
So, I could have a list of titles like this
1) George the Monkey won the bowl
2) The bowl was won by George the Monkey
3) Bowl won by George the Monkey

So, the way I do things now, I generate a query like this
+title:George +title:Monkey +title:Bowl +title:won +title:the

and then do a search. It will then pull back documents.
Now, my first, bad, way of deleting the dupes was to check for scores > some
number and then delete them. However, as my index/crawler(nucth) kept
generating, and I kept merging indexes, the scores kept on getting weirdly

So, I found this forum item on similarity:

and wanted to know if that was a good way of finding these duplicate title
matches. Or, if someone else had a good idea on how to find them?  Now, the
titles are not going to be exact, but fairly similar.

Thanks for your help,

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message