nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cook (JIRA)" <>
Subject [jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring
Date Wed, 20 Dec 2006 22:40:22 GMT
    [ ] 
Doug Cook commented on NUTCH-416:

You may also want to make the status codes ORed values, so that, for example, all of the various
kinds of failure all have a FAILURE code ORed in, making it clean & easy in the code to
check for "any failure case" while still allowing different failure codes. So at  the lowest
levels, the values might be things like FAILED, FETCHED, and UNFETCHED, while REDIRECT might
be (FETCHED | something), specific redirect codes would be (REDIRECT | something), specific
failure codes would be (FAILED | something), etc. This way we can keep all of the specific
failure codes, all the specific redirect codes, etc. while making the code cleaner and more
reliable. We won't have to worry about keeping range checks or switch statements in sync if
we add new codes; a statement like
   if (code & FAILED != 0) {
will always tell us whether a URL fetch failed, regardless of how many codes we add. The way
the code currently is, adding status codes is likely to break things if one is not careful
to go through every place where status codes are examined to ensure that the new code is properly
accounted for.

While you're changing the CrawlDatum, it might also make sense to store a second URL,e.g.
that of the redirect target. I have a hunch this will be very useful.

Just some thoughts. Thanks for making this happen.


> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>                 Key: NUTCH-416
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
> CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, current
values of status codes are linear, which prevents us from adding new codes in proper places.
This is also related to the logic in CrawlDbReducer, which makes decisions based on arithmetic
ordering of status code values.
> I propose to change the codes so that they are grouped into related values, with significant
gaps between groups for adding new codes without causing significant reordering. I also propose
to change the logic in CrawlDbReducer so that its operation is not so dependent on actual
code values.
> A mapping should also be added between old and new codes to facilitate backward-compatibility
of existing data. This mapping should be applied on the fly, without requiring explicit data

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


View raw message