nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy Galema <ferdy.gal...@kalooga.com>
Subject Re: Adaptive generate.max.count
Date Fri, 04 Nov 2011 13:31:08 GMT
Using an adaptive setting is a pretty daunting task. Perhaps a nice 
start would be creating a mechanism that allows exceptional queue 
settings set *by hand*? A resource file would fit purpose for this. 
Later on it could be replaced by automatic settings.

On 11/04/2011 01:56 PM, Markus Jelsma wrote:
>
> On Friday 04 November 2011 13:39:25 Ferdy Galema wrote:
>> Hi Markus,
>>
>> I was wondering what you exactly mean with dynamic. Is it different per
>> fetch cycle but for all queues or do you mean a different value for
>> different queues. (For example, when type is HOST, hostA will have a
>> different generate max count than hostB).
> Yes. I would like to generate more records for domains/hosts with a large
> amount of URL's such a big news sites. For small websites we would want to
> reduce the amount of generated records.
>
> The rationale behind this is that politeness varies between small, medium and
> large sites. We can easily fetch 100 URL's for the big news site but not for a
> small site.
>
> Cheers
>
>
>> Ferdy.
>>
>> On 11/04/2011 12:32 AM, Markus Jelsma wrote:
>>> Hi,
>>>
>>> The generate.max.count defines the number of records per tpye of queue.
>>> We're looking for an improvement to make this setting dynamic. The new
>>> variable would be the number of total records for that type of queue
>>> (ip, host, domain).
>>>
>>> How can we adapt the generator for this? The problem is that there's no
>>> information on the number of records for a given URL.
>>>
>>> Any thoughts? Could we perhaps modify the updater to count the number of
>>> records for a queue and write it to the CrawlDatum without building a new
>>> updater tool based on the information provided by the current
>>> domainstatistics tool?
>>>
>>> Thanks

Mime
View raw message