nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Semyon Semyonov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
Date Tue, 14 Nov 2017 13:22:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251359#comment-16251359
] 

Semyon Semyonov commented on NUTCH-2368:
----------------------------------------

I found a nasty bug that breaks the feature completely.

The generator collected the url if maxcount == 0, because of the condition line 421 if (maxCount
> 0) insead of >= 0

I propose to add the check for condition 
 if(maxCount == 0){
                continue;
}

Could you check it ASAP?


> Variable generate.max.count and fetcher.server.delay
> ----------------------------------------------------
>
>                 Key: NUTCH-2368
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2368
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>    Affects Versions: 1.12
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.14
>
>         Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch,
NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch,
NUTCH-2368_RESTAPI_Fix.patch
>
>
> In some cases we need to use host specific characteristics in determining crawl speed
and bulk sizes because with our (Openindex) settings we can just recrawl host with up to 800k
urls.
> This patch solves the problem by introducing the HostDB to the Generator and providing
powerful Jexl expressions. Check these two expressions added to the Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 800000) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500) / 1000)
* conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 800000) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to fetch based
on number of threads, 95th percentile response time of the fetch limit. Or: queueMaxCount
= (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message