nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: refetching interval
Date Tue, 16 May 2006 21:18:20 GMT
Ledio Ago wrote:
> Hi Michael! Did you get a answer on this one?  It seems like the refetch interval
> is hardcoded, no matter what you set it in the config file, since FETCH_GENERATION_DELAY_MS
takes effect after the first fetch.
> Anybody out there, is this correct, or we are reading this wrong.  If this is correct
> then the refeching feature doesn't work.

This is not the case (i.e. you are reading this wrong :) ). The 
FETCH_GENERATION_DELAY_MS constant specifies how much time needs to pass 
before Pages already selected to be included in a fetchlist will be 
re-considered for selection again, UNLESS they have been updated with 
updatedb (after fetching).

This is to prevent selecting the same pages, if you run FetchListTool 
twice in a rapid succession - but at the same time, if you lost or 
discarded that fetchlist, not to wait indefinitely. 7 days was 
considered to be a good optimum (some large fetch jobs may run for days, 
so it could be a couple days before you have a chance to run updatedb 
with the results of fetching).

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message