nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@media-style.com>
Subject Re: incremental crawling
Date Fri, 02 Dec 2005 09:48:13 GMT

Am 02.12.2005 um 10:15 schrieb Andrzej Bialecki:

> Yes, this is required to detect unmodified content. A small note:  
> plain MD5Hash(byte[] content) is quite ineffective for many pages,  
> e.g. pages with a counter, or with ads. It would be good to provide  
> a framework for other implementations of "page equality" - for now  
> perhaps we should just say that this value is a byte[], and not  
> specifically an MD5Hash.

Some time ago I found a interesting mechanism that may would help us,  
it is called Locality-Sensitive Hashing (LSH).
 From my point of view this is would perfect solution to also remove  
a lot of spam pages,  on my todo list I have a task to write a kind  
of proof of concept, but as we all - I was to busy with other things.
You will find the paper behind the link below and I really would love  
to see this in the nutch sources and I would offer to work with other  
on such a solution.

http://dbpubs.stanford.edu:8090/pub/2000-23
or the pdf:
http://dbpubs.stanford.edu/pub/showDoc.Fulltext? 
lang=en&doc=2000-23&format=pdf&compression=&name=2000-23.pdf

Greetings,
Stefan
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message