nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Adaptive generate.max.count
Date Fri, 04 Nov 2011 12:56:04 GMT


On Friday 04 November 2011 13:39:25 Ferdy Galema wrote:
> Hi Markus,
> 
> I was wondering what you exactly mean with dynamic. Is it different per
> fetch cycle but for all queues or do you mean a different value for
> different queues. (For example, when type is HOST, hostA will have a
> different generate max count than hostB).

Yes. I would like to generate more records for domains/hosts with a large 
amount of URL's such a big news sites. For small websites we would want to 
reduce the amount of generated records.

The rationale behind this is that politeness varies between small, medium and 
large sites. We can easily fetch 100 URL's for the big news site but not for a 
small site.

Cheers


> 
> Ferdy.
> 
> On 11/04/2011 12:32 AM, Markus Jelsma wrote:
> > Hi,
> > 
> > The generate.max.count defines the number of records per tpye of queue.
> > We're looking for an improvement to make this setting dynamic. The new
> > variable would be the number of total records for that type of queue
> > (ip, host, domain).
> > 
> > How can we adapt the generator for this? The problem is that there's no
> > information on the number of records for a given URL.
> > 
> > Any thoughts? Could we perhaps modify the updater to count the number of
> > records for a queue and write it to the CrawlDatum without building a new
> > updater tool based on the information provided by the current
> > domainstatistics tool?
> > 
> > Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Mime
View raw message