nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shailesh Kochhar <>
Subject Duplicate Detection: Offlince vs. Search Time
Date Wed, 12 Apr 2006 22:06:18 GMT

I'm trying to implement a duplicate detection method that doesn't delete 
duplicate pages from the index. Essentially, I want to be able to 
display all the duplicate URLs for a page in the search results instead 
of just the one that was kept in the index.

There are two (potentially more) ways that I can think of to implement this.

1. Offline duplicate detection which deletes the pages from the index 
but stores references to the deleted pages with the copy that is kept. 
The search results can then display all the URLs that have the same content.

2. Duplicate detection at search time that groups identical/similar 
pages together. This method has the advantage that one could implement 
duplicate detection that is sensitive to the query terms. However, it 
would add a performance penalty to the search.

I not very familiar with the Nutch API though I know there's a MD5 
signature based deduping method in place and a Signature class to extend 
for offline duplicate detection. I was wondering if anyone had tried 
search time deduping and what would be good places to try and implement it.

Any other suggestions/advice would be great.

   - Shailesh

View raw message