nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2652) Fetcher launches more fetch tasks than fetch lists
Date Thu, 15 Nov 2018 10:45:01 GMT


Hudson commented on NUTCH-2652:

FAILURE: Integrated in Jenkins build Nutch-trunk #3589 (See [])
NUTCH-2652 Fetcher launches more fetch tasks than fetch lists - properly (snagel: [])
* (edit) src/java/org/apache/nutch/fetcher/

> Fetcher launches more fetch tasks than fetch lists
> --------------------------------------------------
>                 Key: NUTCH-2652
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.15
>         Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 5.15.1, Nutch
built on recent master.
> Seen the first time right now, although running since two months with Nutch 1.15. But
the constraints causing inputs to be split may change from run to run.
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.16
> Fetcher may launch more fetcher tasks than there are fetch lists:
> {noformat}
> 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 128
> 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
> {noformat}
> That's one design principle of Nutch as a MapRecude-based crawler: to ensure politeness
and a guaranteed delay between requests to the same host/domain/ip all items of one host/domain/ip
are put by Generator into the same fetch list. A fetch list may not be split because that
would violate the politeness constraints - multiple fetcher tasks processing the splits of
one fetch list then may send requests to the same host/domain/ip in parallel. See [~ab]'s
chapter about Nutch in [Hadoop the definitive guide (3rd edition)|].

This message was sent by Atlassian JIRA

View raw message