nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Jaekle <>
Subject Re: incremental crawling
Date Thu, 01 Dec 2005 21:49:15 GMT
> 3. Generate a fetch list & fetch it.  When the url has been previously 
> fetched, and its content is unchanged, increase its fetch interval by an 
> amount, e.g., 50%.  If the content is changed, decrease the fetch 
> interval.  The percentage of increase and decrease might be influenced 
> by the url's score.

if we would track in this way the amount of changes, we could also 
prefer pages in the ranking algorithm which change more often.
Frequently changing pages might be more up-to-date and could have a 
higher value then pages never change.
Also pages, which are unchanged for a long time, might run out of date 
and loose a litte bit in their general scoring.
So, maybe the fetch interval value could be used as a multiplier for 
boosting pages in the final result set.


View raw message