nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: VOTE Apache Nutch 2.0 RC1
Date Wed, 13 Jun 2012 23:44:10 GMT
Hi Sebastian,

On Wed, Jun 13, 2012 at 11:30 PM, Sebastian Nagel
<wastl.nagel@googlemail.com> wrote:
>I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed.
> Much simpler than 1.x (no segments!).

:0)

> % ./bin/nutch readdb -stats
> WebTable statistics start
> WebTableReader: java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:197)
>        at java.io.DataInputStream.readFully(DataInputStream.java:169)
>        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
>        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
>        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
>        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
>        at
> org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:89)
>        at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:537)
>        at org.apache.nutch.crawl.WebTableReader.processStatJob(WebTableReader.java:218)
>        at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:479)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:412)
> --> readdb -dump works.

Confirmed and ticket opened as NUTCH-1391

> % ./bin/nutch fetch 1339621550-203073321 -threads 1 -parse
> Exception in thread "main" java.lang.IllegalArgumentException: arg -parse not recognized

The parse argument was removed in Nutch 2.0 and now throws an
illegalargumentexception. This is now normal. To enable parsing during
fetching please set config in nutch-site.xml. The reason that the
incorrect -parse argument is till in the Usage message, is because I
was not diligent enough when patching the fetcher CLI aesthetics. I'll
address this within the issue below as well.

>
>
> % ./bin/nutch parse -all -force -resume
> ParserJob: starting
> ParserJob: resuming:    false           <<< -resume and
> ParserJob: forced reparse:      false   <<< -force obviously ignored ?
> ParserJob: parsing all

Yes confirmed and ticket opened as NUTCH-1392


> % ./bin/nutch generate
> --> generates batchid, but should show help as in 1.x ?
> --> is there an option -topN ?

Yes this is opened in NUTCH-1393. Users may not necessarily wish to
generate at all, instead wishing to merely find out the GeneratorJob
CLI options... I will open this just now and fix for 2.1.

> The 2.0 Solr schema and mappings still contain the field "site"
> which has been removed in 1.x (NUTCH-1232).
> Should be done also in 2.0: it's easier to maintain only one Solr installation
> for all Nutch versions.

Logged in NUTCH-1394

Thanks Seb for your contributions here... this is exactly what we are after.

Does anyone have issues with running another RC and addressing these
issues in 2.1?

-- 
Lewis

Mime
View raw message