nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tittutomen <>
Subject Recrawl Strategy with Nutch!
Date Wed, 14 Oct 2009 10:58:58 GMT

We have crawled a million urls and we want to continuously recrawl these
sites for updates.

The DFS cluster architecture is having 4 machines with 1 Master and 4
Slaves. To crawl the 

1 miilion sites it took around 10 days.


How possibly we will have a recrawl strategy to get the updates quickly? How
will we optimize

the Nutch recrawl script so that frequently changing sites will be recrawled
quickly and the index is formed?

Could we do an incremental index building from the crawl db someway?


Please suggest.

View this message in context:
Sent from the Nutch - Dev mailing list archive at

View raw message