nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Commented: (NUTCH-650) Hbase Integration
Date Mon, 13 Jul 2009 15:39:15 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730380#action_12730380
] 

Doğacan Güney commented on NUTCH-650:
-------------------------------------

Many changes.

First, for simplicity, I changed master branch to be the main development branch. So to take
a look at nutchbase simply do:

git://github.com/dogacan/nutchbase.git

(sorry Andrew for the random change :)

* Upgraded to hbase trunk and hadoop 0.20.

* FetcherHbase now fetches URLs in reduce(). I added a randomization part so that now reduce
does not get URLs from the same host one after another but in a random order. Still politeness
rules are followed and one host will always be in one reducer no matter how many URLs it has
(at least, that's what I tried to do, testing is welcome :). 

* If your fetch is cut short, you almost do not lost any fetched URL as we immediately write
the fetched content to the table*. For example, if you are doing a HUGE one day fetch, and
at the 20th hour your fetch dies, then 20 hour fetching worth of URLs will already be in hbase.
Next execution of FetcherHbase will simply pick up where it left.

* Same thing for ParseTable. If parse crashes in midstream, next execution will continue at
the crash point*.

* Added a "-restart" option for ParseTable and FetcherHbase. If "-restart" is present then
these classes start at the beginning instead of continuing from whereever last run finished.

* Added a "-reindex" option to IndexerHbase to reindex the entire table (Normally only successfully
parsed URLs in that iteration are processed).

* Added a SolrIndexerHbase so you can use solr with hbase (which is awesome :). Also has a
"-reindex" option.

*= We do not immediately write content as hbase client code uses a write buffer to buffer
updates. Still, you will lose very few URLs as opposed to all (and write buffer size can be
made smaller for more safety)

There are still some more stuff to go (such as updating scoring for hbase) but most of the
stuff is, IMHO, ready. Can I get some reviews about what people think of the general direction,
about API, etc? Because this (and katta integration) are my priorities for next nutch.

> Hbase Integration
> -----------------
>
>                 Key: NUTCH-650
>                 URL: https://issues.apache.org/jira/browse/NUTCH-650
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.1
>
>         Attachments: hbase-integration_v1.patch, hbase_v2.patch, malformedurl.patch,
meta.patch, meta2.patch, nofollow-hbase.patch, nutch-habase.patch, searching.diff, slash.patch
>
>
> This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message