nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "behnam nikbakht (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher
Date Tue, 06 Mar 2012 11:00:58 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223157#comment-13223157
] 

behnam nikbakht commented on NUTCH-1067:
----------------------------------------

i can not understand why disable the threshold checker:
throughputThresholdPages = -1;
that cause to enforce this factor once.
                
> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch,
NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be very slow
to crawl due to politeness from robots.txt, e.g. 10s per url. If all other url's have been
fetched, these queue's can stall the entire fetcher, 60 url's can then take 10 minutes or
even more. This can usually be dealt with using the time bomb but the time bomb value is hard
to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum number of
pages per second before the fetcher gives up. It doesn't use the global number of pages /
running time but records the actual pages processed in the previous second. This value is
compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number of pages
per second and bytes per second.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message