nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Semyon Semyonov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
Date Wed, 08 Nov 2017 16:05:01 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244223#comment-16244223
] 

Semyon Semyonov commented on NUTCH-2368:
----------------------------------------

I found a bug. HostdbReaders steams are not reset for each key of the reducer.

Assume we have four hosts in the host-db
A B C D

First time the reducer does reduce for website C, hostdbReaders[i].next leftover is D
The second time we are looking for B, but leftover is D. Therefore the result of hostdbReaders[i].next
is null.
The same for the all following keys of the reducer, hostdb is null. 

private HostDatum getHostDatum(String host) throws Exception {
      Text key = new Text();
      HostDatum value = new HostDatum();
      
      for (int i = 0; i < hostdbReaders.length; i++) {
        while (hostdbReaders[i].next(key, value)) {
          if (host.equals(key.toString())) {
            return value;
          }
        }
      }
      return null;
    }

What do you think is the best method to solve it? Recreate it each time?
          Path path = new Path(job.get(GENERATOR_HOSTDB), "current");
          hostdbReaders = SequenceFileOutputFormat.getReaders(job, path);

> Variable generate.max.count and fetcher.server.delay
> ----------------------------------------------------
>
>                 Key: NUTCH-2368
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2368
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>    Affects Versions: 1.12
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.14
>
>         Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch,
NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch,
NUTCH-2368_RESTAPI_Fix.patch
>
>
> In some cases we need to use host specific characteristics in determining crawl speed
and bulk sizes because with our (Openindex) settings we can just recrawl host with up to 800k
urls.
> This patch solves the problem by introducing the HostDB to the Generator and providing
powerful Jexl expressions. Check these two expressions added to the Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 800000) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500) / 1000)
* conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 800000) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to fetch based
on number of threads, 95th percentile response time of the fetch limit. Or: queueMaxCount
= (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message