nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Kingson (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches
Date Thu, 29 Jan 2015 20:49:36 GMT


Alexander Kingson commented on NUTCH-1922:


My comment is the same for this patch.  Not closed data store can cause memory leaks, I believe.

Also this patch does not solve issues with inlinks and outlinks. Currently, they are not correctly
I would suggest to investigate nutch-1.x code to see how it handles inlinks and outlinks and
transfer that logic to n.2x.
I will do it when I get some time. In the meantime if someone investigates and let us know
how n1.x works, I greatly appreciate. 


> DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches
> --------------------------------------------------------------------------------------------
>                 Key: NUTCH-1922
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Gerhard Gossen
>             Fix For: 2.4
>         Attachments: NUTCH-1922.patch
> When Nutch 2 finds a link to a URL that was crawled in a previous batch, it resets the
fetch status of that URL to {{unfetched}}. This makes this URL available for a re-fetch, even
if its crawl interval is not yet over.
> To reproduce, using version 2.3:
> {code}
> # Nutch configuration
> ant runtime
> cd runtime/local
> mkdir seeds
> echo > seeds/1.txt
> bin/crawl seeds test 2
> {code}
> This uses two files {{a.html}} and {{b.html}} that link to each other.
> In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. In batch
2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. This should update the
score and link fields of {{a.html}}, but not the fetch status. However, when I run {{bin/nutch
readdb -crawlId test -url | grep -a status}}, it returns
{{status: 1 (status_unfetched)}}.
> Expected would be {{status: 2 (status_fetched)}}.
> The reason seems to be that DbUpdateReducer assumes that [links to a URL not processed
in the same batch always belong to new pages|].
Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate job, but that
change skipped all pages with a different batch ID, so I assume that this introduced this

This message was sent by Atlassian JIRA

View raw message