nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Duplicate Detection: Offlince vs. Search Time
Date Wed, 12 Apr 2006 23:42:47 GMT
Shailesh Kochhar wrote:
> I not very familiar with the Nutch API though I know there's a MD5 
> signature based deduping method in place and a Signature class to extend 
> for offline duplicate detection. I was wondering if anyone had tried 
> search time deduping and what would be good places to try and implement it.

Nutch already does search-time deduping.  By default it limits things to 
two hits per host, but you can dedup by other fields and with other 
per-dup counts.  This is available through NutchBean:,%20int,%20int,%20java.lang.String)

and though the OpenSearch servlet.


View raw message