nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2328) GeneratorJob does not generate anything on second run
Date Tue, 18 Oct 2016 21:17:59 GMT


Sebastian Nagel commented on NUTCH-2328:

> the only solution is to have a cluster wide propagated count 

No, this is not required. The solution with an instance variable is by design:
- local, per-reducer limit = topN / number of reducers
- every reducer checks only for the local limit
- in sum, there will be topN URLs generated

The condition is that URLs are evenly distributed across different hosts (at least as many
as there are reducers), cf. [[1|]].

A job-wide counter does not guarantee any limits betters because there is no control how reduce
tasks are launched in time. Only if all tasks run in parallel, with similar speed and no task
fails, an even distribution across reducers/parts would be achieved. But that will hardly
happen in a production Hadoop cluster. In a realistic scenario some tasks are launched first
and will get more URLs. The tasks launched later get less or even no URLs. However, to achieve
an optimal utilization of the fetcher, all parts should be of equal size.

> GeneratorJob does not generate anything on second run
> -----------------------------------------------------
>                 Key: NUTCH-2328
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 2.2, 2.3, 2.2.1, 2.3.1
>         Environment: Ubuntu 16.04 / Hadoop 2.7.1
>            Reporter: Arthur B
>              Labels: fails, generator, subsequent
>             Fix For: 2.4
>         Attachments: generator-issue-static-count.patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
> Given a topN parameter (ie 10) the GeneratorJob will fail to generate anything new on
the subsequent runs within the same process space.
> To reproduce the issue submit the GeneratorJob twice one after another to the M/R framework.
Second time will say it generated 0 URLs.
> This issue is due to the usage of the static count field (org.apache.nutch.crawl.GeneratorReducer#count)
to determine if the topN value has been reached.

This message was sent by Atlassian JIRA

View raw message