nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roger Dunk (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-721) Fetcher2 Slow
Date Thu, 02 Apr 2009 23:43:13 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695170#action_12695170
] 

Roger Dunk commented on NUTCH-721:
----------------------------------

For the following tests I've used the same segment containing 5000 URLs. I cleaned the named
cache before the first two tests.

[root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.OldFetcher newcrawl/segments/20090402130655/

real    3m38.084s
user    2m20.887s
sys     0m7.470s

[root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/

[...]

Fetcher: done

real    53m44.800s
user    2m20.070s
sys     0m9.527s

For this next test, I used the same segment but didn't clear the named cache from the previous
test, so all resolvable hosts should still be cached. This appeared to help greatly, as often
times out of 80 active threads, only 60 were spinwaiting (as opposed to 79 in the non-cached
test), but there were still plenty of times where at least 30 consecutive log entries showed
80 threads spinwaiting. And clearly as can be seen from the times below, still nowhere in
the league of OldFetcher.

[root@server1 trunk]# time bin/nutch org.apache.nutch.fetcher.Fetcher newcrawl/segments/20090402130655/

[...]

Aborting with 80 hung threads.
Fetcher: done

real    22m5.420s
user    2m39.407s
sys     0m8.192s

> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch,
so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast
without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message