nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com>
Subject Re: VOTE Apache Nutch 2.0 RC1
Date Wed, 13 Jun 2012 22:30:03 GMT
Hi Lewis,

> Please see http://wiki.apache.org/nutch/Nutch2Tutorial which is an
> update of Julien's (I think) page on GORA_HBase. Thsi will get you
> rocking with HBase. The changes between Cassandra, Accumulo and the
> other data stores are fairly trivial.

I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed.
Much simpler than 1.x (no segments!).

Below a couple of problems I've run into (possible issues to be adressed in 2.1).

Cheers,
Sebastian



% ./bin/nutch readdb -stats
WebTable statistics start
WebTableReader: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:197)
        at java.io.DataInputStream.readFully(DataInputStream.java:169)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
        at
org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:89)
        at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:537)
        at org.apache.nutch.crawl.WebTableReader.processStatJob(WebTableReader.java:218)
        at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:479)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:412)
--> readdb -dump works.



% ./bin/nutch fetch 1339621550-203073321 -threads 1 -parse
Exception in thread "main" java.lang.IllegalArgumentException: arg -parse not recognized



% ./bin/nutch parse -all -force -resume
ParserJob: starting
ParserJob: resuming:    false           <<< -resume and
ParserJob: forced reparse:      false   <<< -force obviously ignored ?
ParserJob: parsing all



% ./bin/nutch generate
--> generates batchid, but should show help as in 1.x ?
--> is there an option -topN ?



The 2.0 Solr schema and mappings still contain the field "site"
which has been removed in 1.x (NUTCH-1232).
Should be done also in 2.0: it's easier to maintain only one Solr installation
for all Nutch versions.


Mime
View raw message