nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: incremental crawling
Date Fri, 02 Dec 2005 09:15:18 GMT
Doug Cutting wrote:

> It would be good to improve the support for incremental crawling added 
> to Nutch.  Here are some ideas about how we might implement it.  
> Andrzej has posted in the past about this, so he probably has better 
> ideas.
> Incremental crawling could proceed as follows:
> 1. Bootstrap with a batch crawl, using the 'crawl' command.  Modify 
> CrawlDatum to store the MD5Hash of the content of fetched urls.

Yes, this is required to detect unmodified content. A small note: plain 
MD5Hash(byte[] content) is quite ineffective for many pages, e.g. pages 
with a counter, or with ads. It would be good to provide a framework for 
other implementations of "page equality" - for now perhaps we should 
just say that this value is a byte[], and not specifically an MD5Hash.

Other additions to CrawlDatum for consideration:

* last modified time, not just the last fetched time - these two are 
different, and the fetching policy will depend on both. E.g. to 
synchronize with the page change cycle it is necessary to know the time 
of the previous modification seen by Nutch. I've done simulations, which 
show that if we don't track this value then the fetchInterval 
adjustments won't stabilize even if the page change cycle is fixed.

* segment name from the last updatedb. I'm not fully convinced about 
this, but consider the following:

I think this is needed in order to check which segments may be safely 
deleted, because there are no more active pages in them. If you enable a 
variable fetchInterval, then after a while you will end up with widely 
ranging intervals - some pages will have a daily or hourly period, some 
others will have a period of several months. Add to this the fact that 
you start counting the time for each page at different moments, and then 
the oldest page you have could be as old as maxFetchInterval (whatever 
that is, Float.MAX_VALUE or some other maximum you set). Most likely 
such old pages would live in segments with very little current data.

Now, you need to minimize the number of active segments (because of 
search performance and the time to deduplicate). However, with variable 
fetchInterval you no longer know which segments it is safe to delete. I 
imagine a tool could collect all segment names from CrawlDB, and prepare 
a list (segmentName, numRecords). Those segments that are not found on 
this list it would be safe to delete. Those segments that have few 
records could be processed to extract those records and move them to a 
single segment (and discard the rest of old segment data).


Alternatively, we could add Properties to CrawlDatum, and let people put 
whatever they wish there...

> 2. Reduce the fetch interval for high-scoring urls.  If the default is 
> monthly, then the top-scoring 1% of urls might be set to daily, and 
> the top-scoring 10% of urls might be set to weekly.

In the original patchset I had a notion of pluggable FetchSchedule-s. I 
think this would be an ideal place to make such decisions. 
Implementations would be pluggable in a similar way as URLFilter, with 
the DefaultFetchSchedule doing what we do today.

> 3. Generate a fetch list & fetch it.  When the url has been previously 
> fetched, and its content is unchanged, increase its fetch interval by 
> an amount, e.g., 50%.  If the content is changed, decrease the fetch 
> interval.  The percentage of increase and decrease might be influenced 
> by the url's score.

Again, that's the task for a FetchSchedule.

> 4. Update the crawl db & link db, index the new segment, dedup, etc. 
> When updating the crawl db, scores for existing urls should not 
> change, since the scoring method we're using (OPIC) assumes each page 
> is fetched only once.

I would love to refactor this part too, to make the scoring mechanism 
abstracted in a similar way, so that you could plug in different scoring 
implementations. The float value in CrawlDatum is opaque enough to 
support different scoring mechanisms.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message