nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
Date Tue, 16 Mar 2010 21:30:27 GMT


Andrzej Bialecki  commented on NUTCH-762:

It appears this class is not a strict superset - the generate.update.crawldb functionality
is not there. This is a regression in a useful functionality, so I think it needs to be added

> Alternative Generator which can generate several segments in one parse of the crawlDB
> -------------------------------------------------------------------------------------
>                 Key: NUTCH-762
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-762-v2.patch
> When using Nutch on a large scale (e.g. billions of URLs), the operations related to
the crawlDB (generate - update) tend to take the biggest part of the time. One solution is
to limit such operations to a minimum by generating several fetchlists in one parse of the
crawlDB then update the Db only once on several segments. The existing Generator allows several
successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In
practice this approach does not work well as we need to read the whole crawlDB as many time
as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can generate
several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator
in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  fetching (during
the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable
on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning;
however as we can't count the max number of URLs by IP another unit must be chosen while partitioning
by IP. 
> We found that using a filter on the score can dramatically improve the performance as
this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers
numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than the max
value select e.g. not enough URLs are available for fetching and fit in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message