nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew McCall (JIRA)" <>
Subject [jira] Updated: (NUTCH-650) Hbase Integration
Date Tue, 03 Mar 2009 14:54:56 GMT


Andrew McCall updated NUTCH-650:

    Attachment: meta.patch

I've updated the way the TMP_X_MARK metadata is handled to allow multiple fetch cycles to
take place at the same time. 

* GeneratorHbase adds the TMP_FETCH_MARK as before
* FetcherHbase 
** crawls any rows with TMP_FETCH_MARK set and sets TMP_PARSE_MARK so the Parser knows to
parse the row as before
** removes the column TMP_FETCH_MARK so that any other later fetch between now and calling
UpdateTable won't re-fetch the row. 
* ParseTable 
** parses any rows with TMP_PARSE_MARK set and sets TMP_UPDATE_MARK as before
** removes the column TMP_PARSE_MARK so that a later parse won't re-parse the row. 
* UpdateTable now only updates rows with TMP_UPDATE_MARK set by default leaving rows that
have not been fetched or parsed yet in their current state.
* calling UpdateTable with the new -all option forces UpdateTable to update all rows in the
table and acts as it did before the patch removing any TMP_X_MARK rows.  

> Hbase Integration
> -----------------
>                 Key: NUTCH-650
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.1
>         Attachments: hbase-integration_v1.patch, hbase_v2.patch, malformedurl.patch,
meta.patch, nofollow-hbase.patch, nutch-habase.patch, slash.patch
> This issue will track nutch/hbase integration

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message