nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy Galema <ferdy.gal...@kalooga.com>
Subject Re: A FetchSchedule bug makes fetch time becoming more and more big
Date Wed, 15 Aug 2012 12:24:21 GMT
Hi,

Yeah this is something I noticed too some while ago. Although it does not
directly break the crawling directly, it is not a nice implementation.
Notice that the Generator tries to correct for fetchtime too far off in the
future. (In the AbstractFetchSchedule shouldFetch method.)

As a matter of fact I have refactored the updating process slightly to only
update the fetchtime once. (Directly after a fetch that is). The best part
is that this change allows for running several generate-fetch cycles
without running the updater every time. There is a slight downside but I
will post it in the issue after I have attached a patch for this
improvement:
https://issues.apache.org/jira/browse/NUTCH-1457

Ferdy.

On Wed, Aug 15, 2012 at 2:11 PM, lin weijian <linweijian8@gmail.com> wrote:

>
> Hi,
> When DbUpdateReducer executes, it will call setFetchSchedule for a
> fetched page. This function will
> add fetch interval to the new fetch time, no matter if it has been added
> up. It makes the fetch time becoming more and more big.    It's should add
> fetch interval to last fetch time.
>
>     Thanks.
>

Mime
View raw message