nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <>
Subject [jira] [Commented] (NUTCH-207) Bandwidth target for fetcher rather than a thread count
Date Thu, 17 Apr 2014 14:35:15 GMT


Julien Nioche commented on NUTCH-207:

Am starting to think that the cleanest way to implement this would be to make some radical
changes to the way the Fetcher works and use the Executor framework. The ThreadPoolExecutor
is quite a nice fit for that as it defines a max number of threads to use but would require
changing the logic in the Fetcher and get the queues to push the tasks to the Executor instead
of having the FetcherThreads polling them for work. Will probably open a new issue for this.

> Bandwidth target for fetcher rather than a thread count
> -------------------------------------------------------
>                 Key: NUTCH-207
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Rod Taylor
>            Assignee: Julien Nioche
>             Fix For: 1.9
>         Attachments: ratelimit.patch
> Increases or decreases the number of threads from the starting value (fetcher.threads.fetch)
up to a maximum (fetcher.threads.maximum) to achieve a target bandwidth (fetcher.threads.bandwidth).
> It seems to be able to keep within 10% of the target bandwidth even when large numbers
of errors are found or when a number of large pages is run across.
> To achieve more accurate tracking Nutch should keep track of protocol overhead as well
as the volume of pages downloaded.

This message was sent by Atlassian JIRA

View raw message