nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject CrawlDbReducer - selecting data for DB update
Date Fri, 07 Apr 2006 10:24:11 GMT

The more I look at CrawlDbReducer the less I like the method it uses to 
select the most recent records.

This selection is primarily made in the while() loop in 
CrawlDbReducer:45. My main objection is that selecting the "highest" 
value (meaning "most recent") relies on the fact that values of status 
codes in CrawlDatum are ordered according to their meaning, and they are 
treated as a sort of state machine. However, adding new states is very 
difficult, if they should have values lower than STATUS_FETCH_GONE, as 
it leads to breaking backwards-compatibility with older segment data. 
Adding status codes with higher values may also break things here, 
because a CrawlDatum with the highest code would not be necessarily the 
most recent.

I encountered this problem first when adding the signature framework, 
fortunately there was one unused value (0) at that time, so I could add 
CrawlDatum.STATUS_SIGNATURE without breaking the assumptions in 

However, now things become more difficult:

* we need another status code for newly discovered pages discovered as a 
result of redirection (see the thread on "Meta-refresh"). If we add this 
status as e.g. STATUS_FETCH_REDIRECT = 8, then the logic in 
CrawlDbReducer will break.

* we need something to mark pages as "being on a fetchlist, to be 
updated soon" (this is to support multiple parallel 
generate/fetch/update cycles). A new status code would do fine for this 
purpose (although we need an expiry timer for that too). Arguably, we 
could use the same trick that we used in 0.7 (moving next fetch time 1 
week into the future), but I'm not sure yet how it would play with the 
adaptive fetch patches, which manipulate this value too...

I could use a hack in the meantime: status values are for now all below 
128, we could use the upper nibble for these additional flags, and mask 
them out with 0x0f.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message