nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@nutch.org>
Subject incremental crawling
Date Thu, 01 Dec 2005 19:15:49 GMT
It would be good to improve the support for incremental crawling added 
to Nutch.  Here are some ideas about how we might implement it.  Andrzej 
has posted in the past about this, so he probably has better ideas.

Incremental crawling could proceed as follows:

1. Bootstrap with a batch crawl, using the 'crawl' command.  Modify 
CrawlDatum to store the MD5Hash of the content of fetched urls.

2. Reduce the fetch interval for high-scoring urls.  If the default is 
monthly, then the top-scoring 1% of urls might be set to daily, and the 
top-scoring 10% of urls might be set to weekly.

3. Generate a fetch list & fetch it.  When the url has been previously 
fetched, and its content is unchanged, increase its fetch interval by an 
amount, e.g., 50%.  If the content is changed, decrease the fetch 
interval.  The percentage of increase and decrease might be influenced 
by the url's score.

4. Update the crawl db & link db, index the new segment, dedup, etc. 
When updating the crawl db, scores for existing urls should not change, 
since the scoring method we're using (OPIC) assumes each page is fetched 
only once.

Steps 3 & 4 can be packaged as an 'update' command.  Step 2 can be 
included in the 'crawl' command, so that crawled indexes are always 
ready for update.

Comments?

Doug

Mime
View raw message