nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Re: Nutch 2.0 Help
Date Wed, 08 Sep 2010 10:53:41 GMT
Hi guys,

I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
http://wiki.apache.org/nutch/GORA_HBase

Feel free to amend and improve as you see fit.

Please bear in mind that Nutch 2.0 is at a very early stage and is far from
being bug-proof, see in particular [1].

HTH

Julien

[1] https://issues.apache.org/jira/browse/NUTCH-893

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


On 6 September 2010 13:35, Andrzej Bialecki <ab@getopt.org> wrote:

> On 2010-09-05 14:56, David Stuart wrote:
>
>> Hi All,
>>
>> I have done as per below and can create a table from within the hbase
>> shell. I found the appropriate create table method
>> bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only
>> returns null
>>
>> Any help would be great
>>
>
> You don't have to create a table manually - this should happen
> automatically when you first run any Nutch tool. Just make sure you have
> hbase-site.xml on your classpath in Nutch - best if you put it in your conf/
> and rebuild, so that it's packed into a job jar.
>
> Here's for example my config files that work with HBase (I don't use any
> non-standard settings for HBase, so my hbase-site.xml has no properties, but
> still it needs to be included in Nutch job jar):
>
> gora-hbase-mapping.xml:
> -------------------------------------------------------------------------
>
> <gora-orm>
>
> <table name="webtable">
>  <family name="p"/> <!-- This can also have params like compression, bloom
> filters -->
>  <family name="f"/>
>  <family name="s"/>
>  <family name="il"/>
>  <family name="ol"/>
>  <family name="h"/>
>  <family name="mtdt"/>
>  <family name="mk"/>
> </table>
>
> <class table="webtable" keyClass="java.lang.String"
> name="org.apache.nutch.storage.WebPage">
>  <!-- fetch fields                                       -->
>  <field name="baseUrl" family="f" qualifier="bas"/>
>  <field name="status" family="f" qualifier="st"/>
>  <field name="prevFetchTime" family="f" qualifier="pts"/>
>  <field name="fetchTime" family="f" qualifier="ts"/>
>  <field name="fetchInterval" family="f" qualifier="fi"/>
>  <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
>  <field name="reprUrl" family="f" qualifier="rpr"/>
>  <field name="content" family="f" qualifier="cnt"/>
>  <field name="contentType" family="f" qualifier="typ"/>
>  <field name="protocolStatus" family="f" qualifier="prot"/>
>  <field name="modifiedTime" family="f" qualifier="mod"/>
>
>  <!-- parse fields                                       -->
>  <field name="title" family="p" qualifier="t"/>
>  <field name="text" family="p" qualifier="c"/>
>  <field name="parseStatus" family="p" qualifier="st"/>
>  <field name="signature" family="p" qualifier="sig"/>
>  <field name="prevSignature" family="p" qualifier="psig"/>
>
>  <!-- score fields                                       -->
>  <field name="score" family="s" qualifier="s"/>
>
>  <field name="headers" family="h"/>
>
>  <field name="inlinks" family="il"/>
>
>  <field name="outlinks" family="ol"/>
>
>  <field name="metadata" family="mtdt"/>
>
>  <field name="markers" family="mk"/>
>
> </class>
>
> </gora-orm>
> -------------------------------------------------------------------------
>
> nutch-site.xml:
> -------------------------------------------------------------------------
> ... blah blah, a lot of unrelated stuff...
>
> <property>
>  <name>storage.data.store.class</name>
>  <value>org.gora.hbase.store.HBaseStore</value>
>
>  <description>Default class for storing data</description>
> </property>
> -------------------------------------------------------------------------
>
> Of course you need also to use the same hadoop files (hdfs-site and
> mapred-site) as the ones that HBase uses.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Mime
View raw message