nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Closed: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring
Date Thu, 28 Dec 2006 00:14:22 GMT
     [ ]

Andrzej Bialecki  closed NUTCH-416.

    Resolution: Fixed

Fixed in trunk, rev. 490607. As a side effect it is now possible to correctly update CrawlDB
from multiple segments, even if they contain duplicate pages - the code in CrawlDbReducer
will correctly apply only the latest version of CrawlDatum.

> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>                 Key: NUTCH-416
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
> CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, current
values of status codes are linear, which prevents us from adding new codes in proper places.
This is also related to the logic in CrawlDbReducer, which makes decisions based on arithmetic
ordering of status code values.
> I propose to change the codes so that they are grouped into related values, with significant
gaps between groups for adding new codes without causing significant reordering. I also propose
to change the logic in CrawlDbReducer so that its operation is not so dependent on actual
code values.
> A mapping should also be added between old and new codes to facilitate backward-compatibility
of existing data. This mapping should be applied on the fly, without requiring explicit data

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


View raw message