nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tang <>
Subject Re: incremental crawling
Date Fri, 02 Dec 2005 02:12:12 GMT
Hi Doug

1. How to deal with "dead urls"? If I remove the url after nutch 1st
crawling. Should nutch keeps the "dead urls" and never fetches them
2. should nutch export dedup as one extension point? In my project, we
add information extraction layer to nutch, I think it is good idea
export dedup as extension point since we can build our "duplicates
rule" base on extracted data object, of course, the default is page


On 12/2/05, Doug Cutting <> wrote:
> It would be good to improve the support for incremental crawling added
> to Nutch.  Here are some ideas about how we might implement it.  Andrzej
> has posted in the past about this, so he probably has better ideas.
> Incremental crawling could proceed as follows:
> 1. Bootstrap with a batch crawl, using the 'crawl' command.  Modify
> CrawlDatum to store the MD5Hash of the content of fetched urls.
> 2. Reduce the fetch interval for high-scoring urls.  If the default is
> monthly, then the top-scoring 1% of urls might be set to daily, and the
> top-scoring 10% of urls might be set to weekly.
> 3. Generate a fetch list & fetch it.  When the url has been previously
> fetched, and its content is unchanged, increase its fetch interval by an
> amount, e.g., 50%.  If the content is changed, decrease the fetch
> interval.  The percentage of increase and decrease might be influenced
> by the url's score.
> 4. Update the crawl db & link db, index the new segment, dedup, etc.
> When updating the crawl db, scores for existing urls should not change,
> since the scoring method we're using (OPIC) assumes each page is fetched
> only once.
> Steps 3 & 4 can be packaged as an 'update' command.  Step 2 can be
> included in the 'crawl' command, so that crawled indexes are always
> ready for update.
> Comments?
> Doug

Keep Discovering ... ...

View raw message