nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Adaptive generate.max.count
Date Thu, 03 Nov 2011 23:32:12 GMT
Hi,

The generate.max.count defines the number of records per tpye of queue. We're 
looking for an improvement to make this setting dynamic. The new variable 
would be the number of total records for that type of queue (ip, host, 
domain).

How can we adapt the generator for this? The problem is that there's no 
information on the number of records for a given URL. 

Any thoughts? Could we perhaps modify the updater to count the number of 
records for a queue and write it to the CrawlDatum without building a new 
updater tool based on the information provided by the current domainstatistics 
tool?

Thanks

Mime
View raw message