nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf>
Subject Re: incremental crawling
Date Fri, 02 Dec 2005 09:48:13 GMT

Am 02.12.2005 um 10:15 schrieb Andrzej Bialecki:

> Yes, this is required to detect unmodified content. A small note:  
> plain MD5Hash(byte[] content) is quite ineffective for many pages,  
> e.g. pages with a counter, or with ads. It would be good to provide  
> a framework for other implementations of "page equality" - for now  
> perhaps we should just say that this value is a byte[], and not  
> specifically an MD5Hash.

Some time ago I found a interesting mechanism that may would help us,  
it is called Locality-Sensitive Hashing (LSH).
 From my point of view this is would perfect solution to also remove  
a lot of spam pages,  on my todo list I have a task to write a kind  
of proof of concept, but as we all - I was to busy with other things.
You will find the paper behind the link below and I really would love  
to see this in the nutch sources and I would offer to work with other  
on such a solution.
or the pdf: 

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message