nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring
Date Wed, 20 Dec 2006 23:18:22 GMT
    [ ] 
Andrzej Bialecki  commented on NUTCH-416:

There are two main distinct groups of status codes, but not along the lines of success/failure
- these are DB and Fetch status codes. Additionally, the number of available bits for a bitmask
is very small, because the status needs to fit in a byte.

My patch in progress contains the following now:

  public static final byte STATUS_DB_UNFETCHED      = 0x01;
  public static final byte STATUS_DB_FETCHED        = 0x02;
  public static final byte STATUS_DB_GONE           = 0x03;
  public static final byte STATUS_DB_REDIR_TEMP     = 0x04;
  public static final byte STATUS_DB_REDIR_PERM     = 0x05;
  /** Maximum value of DB-related status. */
  public static final byte STATUS_DB_MAX            = 0x1f;
  public static final byte STATUS_FETCH_SUCCESS     = 0x21;
  public static final byte STATUS_FETCH_RETRY       = 0x22;
  public static final byte STATUS_FETCH_REDIR_TEMP  = 0x23;
  public static final byte STATUS_FETCH_REDIR_PERM  = 0x24;
  public static final byte STATUS_FETCH_GONE        = 0x25;
  /** Maximum value of fetch-related status. */
  public static final byte STATUS_FETCH_MAX         = 0x3f;
  public static final byte STATUS_SIGNATURE         = 0x41;
  public static final byte STATUS_INJECTED          = 0x42;
  public static final byte STATUS_LINKED            = 0x43;
  public static boolean hasDbStatus(CrawlDatum datum) {
    if (datum.status <= STATUS_DB_MAX) return true;
    return false;

  public static boolean hasFetchStatus(CrawlDatum datum) {
    if (datum.status > STATUS_DB_MAX && datum.status <= STATUS_FETCH_MAX) return
    return false;

... so, I went with ranges of values. The most unwieldy switch() statements in the current
code were related to the checking between DB or Fetch status, and the above two static methods
handle this and simplify the code.

Regarding the redirect URL - because of space constraints I'd rather use Metadata for this.
We already handle metadata efficiently, so that performance doesn't suffer if we don't have
any metadata to keep. It would make sense, though, to have a predefined key for this URL.

> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>                 Key: NUTCH-416
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
> CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, current
values of status codes are linear, which prevents us from adding new codes in proper places.
This is also related to the logic in CrawlDbReducer, which makes decisions based on arithmetic
ordering of status code values.
> I propose to change the codes so that they are grouped into related values, with significant
gaps between groups for adding new codes without causing significant reordering. I also propose
to change the logic in CrawlDbReducer so that its operation is not so dependent on actual
code values.
> A mapping should also be added between old and new codes to facilitate backward-compatibility
of existing data. This mapping should be applied on the fly, without requiring explicit data

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


View raw message