nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "AJ Chen" <cano...@gmail.com>
Subject Re: [jira] Commented: (NUTCH-395) Increase fetching speed
Date Wed, 22 Nov 2006 23:14:58 GMT
Linux box, opteron 2Ghz, 2GB RAM, DSL download bandwidth up to 5mbps.

This is a new crawldb, crawling on 4000 selected sites, total ~1 million
pages fetched after last run.

use default regex-urlfilter.txt except for :
-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|lha|md5|mov|
mp3|mp4|mpg|msi|ogg|png|pps|ppt|ps|psd|ram|ris|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xls|z|zip)\)?$
-[*!@#]

additional filter to limit urls to the selected domains  (hashtable
implementation)

plugins:
protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic

use default org.apache.nutch.net.URLNormalizer

thanks for helping,
AJ


parse only html and text

On 11/22/06, Sami Siren <ssiren@gmail.com> wrote:
>
> What kind of hardware are you running on? Your pages per sec ratio seems
> very low to me.
>
> How big was your crawldb when you started and how big was it at end?
>
> What kind of filters and normalizers are you using?
>
> --
>   Sami Siren
>
> AJ Chen wrote:
> > I checked out the code from trunk after Sami committed the change. I
> > started
> > out a new crawl db and run several cycles of crawl sequentially on one
> > linux
> > server. See below for the real numbers from my test.  The performance is
> > still poor because the crawler still spend too much time in reduce and
> > update operations.
> >
> > #crawl cycle: topN=200000
> > 2006-11-17 17:25:27,367 INFO  crawl.Generator - Generator: segment:
> > crawl/segments/20061117172527
> > 2006-11-17 17:47:45,837 INFO  fetcher.Fetcher - Fetcher: segment:
> > crawl/segments/20061117172527
> > # 8 hours fetching ~200000 pages
> > 2006-11-18 03:13:31,992 INFO  mapred.LocalJobRunner - 183644 pages, 5506
> > errors, 5.4 pages/s, 1043 kb/s,
> > # 4 hours doing "reduce"
> > 2006-11-18 07:30:38,085 INFO  crawl.CrawlDb - CrawlDb update: starting
> > # 4 hours update db
> > 2006-11-18 11:17:54,000 INFO  crawl.CrawlDb - CrawlDb update: done
> >
> > #crawl sycle: topN=500,000 pages
> > 2006-11-18 13:22:51,530 INFO  crawl.Generator - Generator: segment:
> > crawl/segments/20061118132251
> > 2006-11-18 14:50:07,006 INFO  fetcher.Fetcher - Fetcher: segment:
> > crawl/segments/20061118132251
> > # fetching for 16 hours
> > 2006-11-19 06:53:34,923 INFO  mapred.LocalJobRunner - 394343 pages,
> 19050
> > errors, 6.8 pages/s, 1439 kb/s,
> > # reduce for 11 hours
> > 2006-11-19 17:49:15,778 INFO  crawl.CrawlDb - CrawlDb update: segment:
> > crawl/segments/20061118132251
> > # update db for 10 hours
> > 2006-11-20 03:55:22,882 INFO  crawl.CrawlDb - CrawlDb update: done
> >
> > #crawl cycle: topN=600,000 pages
> > 2006-11-20 08:14:51,463 INFO  crawl.Generator - Generator: segment:
> > crawl/segments/20061120081451
> > 2006-11-20 11:31:22,384 INFO  fetcher.Fetcher - Fetcher: segment:
> > crawl/segments/20061120081451
> > #fetching for 18 hours
> > 2006-11-21 06:00:08,504 INFO  mapred.LocalJobRunner - 410078 pages,
> 26316
> > errors, 6.2 pages/s, 1257 kb/s,
> > #reduce for 11 hours
> > 2006-11-21 17:26:38,213 INFO  crawl.CrawlDb - CrawlDb update: starting
> > #update for 13 hours
> > 2006-11-22 06:25:48,592 INFO  crawl.CrawlDb - CrawlDb update: done
> >
> >
> > -AJ
> >
> >
> > On 11/13/06, Andrzej Bialecki (JIRA) <jira@apache.org> wrote:
> >>
> >>     [
> >>
> http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12449292
> ]
> >>
> >>
> >> Andrzej Bialecki  commented on NUTCH-395:
> >> -----------------------------------------
> >>
> >> +1 - this patch looks good to me - if you could just fix the whitespace
> >> issues prior to committing, so that it conforms to the coding style ...
> >>
> >> > Increase fetching speed
> >> > -----------------------
> >> >
> >> >                 Key: NUTCH-395
> >> >                 URL: http://issues.apache.org/jira/browse/NUTCH-395
> >> >             Project: Nutch
> >> >          Issue Type: Improvement
> >> >          Components: fetcher
> >> >    Affects Versions: 0.9.0, 0.8.1
> >> >            Reporter: Sami Siren
> >> >         Assigned To: Sami Siren
> >> >         Attachments: nutch-0.8-performance.txt,
> >> NUTCH-395-trunk-metadata-only-2.patch,
> >> NUTCH-395-trunk-metadata-only.patch
> >> >
> >> >
> >> > There have been some discussion on nutch mailing lists about fetcher
> >> being slow, this patch tried to address that. the patch is just a
> >> quich hack
> >> and needs some cleaning up, it also currently applies to 0.8 branch and
> >> not trunk and it has also not been tested in large. What it changes?
> >> > Metadata - the original metadata uses spellchecking, new version does
> >> not (a decorator is provided that can do it and it should perhaps be
> used
> >> where http headers are handled but in most of the cases the
> >> functionality is
> >> not required)
> >> > Reading/writing various data structures - patch tries to do io more
> >> efficiently see the patch for details.
> >> > Initial benchmark:
> >> > A small benchmark was done to measure the performance of changes with
> a
> >> script that basically does the following:
> >> > -inject a list of urls into a fresh crawldb
> >> > -create fetchlist (10k urls pointing to local filesystem)
> >> > -fetch
> >> > -updatedb
> >> > original code from 0.8-branch:
> >> > real    10m51.907s
> >> > user    10m9.914s
> >> > sys     0m21.285s
> >> > after applying the patch
> >> > real    4m15.313s
> >> > user    3m42.598s
> >> > sys     0m18.485s
> >>
> >> --
> >> This message is automatically generated by JIRA.
> >> -
> >> If you think it was sent incorrectly contact one of the administrators:
> >> http://issues.apache.org/jira/secure/Administrators.jspa
> >> -
> >> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >>
> >>
> >>
> >
> >
>
>


-- 
AJ Chen, PhD
Palo Alto, CA
http://web2express.org

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message