nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "AJ Chen" <cano...@gmail.com>
Subject Re: [jira] Commented: (NUTCH-395) Increase fetching speed
Date Wed, 22 Nov 2006 17:09:12 GMT
I checked out the code from trunk after Sami committed the change. I started
out a new crawl db and run several cycles of crawl sequentially on one linux
server. See below for the real numbers from my test.  The performance is
still poor because the crawler still spend too much time in reduce and
update operations.

#crawl cycle: topN=200000
2006-11-17 17:25:27,367 INFO  crawl.Generator - Generator: segment:
crawl/segments/20061117172527
2006-11-17 17:47:45,837 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20061117172527
# 8 hours fetching ~200000 pages
2006-11-18 03:13:31,992 INFO  mapred.LocalJobRunner - 183644 pages, 5506
errors, 5.4 pages/s, 1043 kb/s,
# 4 hours doing "reduce"
2006-11-18 07:30:38,085 INFO  crawl.CrawlDb - CrawlDb update: starting
# 4 hours update db
2006-11-18 11:17:54,000 INFO  crawl.CrawlDb - CrawlDb update: done

#crawl sycle: topN=500,000 pages
2006-11-18 13:22:51,530 INFO  crawl.Generator - Generator: segment:
crawl/segments/20061118132251
2006-11-18 14:50:07,006 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20061118132251
# fetching for 16 hours
2006-11-19 06:53:34,923 INFO  mapred.LocalJobRunner - 394343 pages, 19050
errors, 6.8 pages/s, 1439 kb/s,
# reduce for 11 hours
2006-11-19 17:49:15,778 INFO  crawl.CrawlDb - CrawlDb update: segment:
crawl/segments/20061118132251
# update db for 10 hours
2006-11-20 03:55:22,882 INFO  crawl.CrawlDb - CrawlDb update: done

#crawl cycle: topN=600,000 pages
2006-11-20 08:14:51,463 INFO  crawl.Generator - Generator: segment:
crawl/segments/20061120081451
2006-11-20 11:31:22,384 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20061120081451
#fetching for 18 hours
2006-11-21 06:00:08,504 INFO  mapred.LocalJobRunner - 410078 pages, 26316
errors, 6.2 pages/s, 1257 kb/s,
#reduce for 11 hours
2006-11-21 17:26:38,213 INFO  crawl.CrawlDb - CrawlDb update: starting
#update for 13 hours
2006-11-22 06:25:48,592 INFO  crawl.CrawlDb - CrawlDb update: done


-AJ


On 11/13/06, Andrzej Bialecki (JIRA) <jira@apache.org> wrote:
>
>     [
> http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12449292]
>
> Andrzej Bialecki  commented on NUTCH-395:
> -----------------------------------------
>
> +1 - this patch looks good to me - if you could just fix the whitespace
> issues prior to committing, so that it conforms to the coding style ...
>
> > Increase fetching speed
> > -----------------------
> >
> >                 Key: NUTCH-395
> >                 URL: http://issues.apache.org/jira/browse/NUTCH-395
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: fetcher
> >    Affects Versions: 0.9.0, 0.8.1
> >            Reporter: Sami Siren
> >         Assigned To: Sami Siren
> >         Attachments: nutch-0.8-performance.txt,
> NUTCH-395-trunk-metadata-only-2.patch, NUTCH-395-trunk-metadata-only.patch
> >
> >
> > There have been some discussion on nutch mailing lists about fetcher
> being slow, this patch tried to address that. the patch is just a quich hack
> and needs some cleaning up, it also currently applies to 0.8 branch and
> not trunk and it has also not been tested in large. What it changes?
> > Metadata - the original metadata uses spellchecking, new version does
> not (a decorator is provided that can do it and it should perhaps be used
> where http headers are handled but in most of the cases the functionality is
> not required)
> > Reading/writing various data structures - patch tries to do io more
> efficiently see the patch for details.
> > Initial benchmark:
> > A small benchmark was done to measure the performance of changes with a
> script that basically does the following:
> > -inject a list of urls into a fresh crawldb
> > -create fetchlist (10k urls pointing to local filesystem)
> > -fetch
> > -updatedb
> > original code from 0.8-branch:
> > real    10m51.907s
> > user    10m9.914s
> > sys     0m21.285s
> > after applying the patch
> > real    4m15.313s
> > user    3m42.598s
> > sys     0m18.485s
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>


-- 
AJ Chen, PhD
Palo Alto, CA
http://web2express.org

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message