nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From George Herlin <>
Subject Re: Infinite loop bug in Nutch 0.9
Date Wed, 01 Apr 2009 10:29:56 GMT
Sorry, forgot to say, there is an added precondition to causing the bug:

The redirection has to be fetched before the page it redirects to... if not,
there will be a pre.existing crawl datum with an reasonable

2009/4/1 George Herlin <>

> Hello, there.
> I believe I may have found a infinite loop in Nutch 0.9.
> It happens when a site has a page that refers to itself through a
> redirection.
> The code in, around line 200 - sorry, my Fetcher has been a
> little modified, line numbers may vary a little - says, for that case:
> output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED);
> What that does is, inserts an extra (empty) crawl datum for the new url,
> with a re-fetch interval of 0.0.
> However, (see, particularly lines 144-145), the
> non-refetch condition used seems to be last-fetch+refetch-interval>now ...
> which is always false if refetch-interval==0.0!
> Now, if there is a new link to the new url in that page, that crawl datum
> is re-used, and the whole thing loops indefinitely.
> I've fixed that for myself by changing the quoted line (twice) by:
> output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null,
> CrawlDatum.STATUS_LINKED);
> and that works (btw the 30F should really be the value of
> "db.default.fetch.interval", but I haven't the time right now to work out
> the issues, but in reality the default constructor and the appropriate
> updater method should, if I am right in analysing the algorithm always
> enforce a positive refetch interval.
> Of course, another method could be used to remove this self-reference, but
> that couls be complicated, as that may happen through a loop (2 or more
> pages etc..., you know what I mean).
> Has that been fixed already, and by what method?
> Best regards
> George Herlin

View raw message