nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <>
Subject Re: VOTE Apache Nutch 2.0 RC1
Date Wed, 13 Jun 2012 23:44:10 GMT
Hi Sebastian,

On Wed, Jun 13, 2012 at 11:30 PM, Sebastian Nagel
<> wrote:
>I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed.
> Much simpler than 1.x (no segments!).


> % ./bin/nutch readdb -stats
> WebTable statistics start
> WebTableReader:
>        at
>        at
>        at$Reader.init(
>        at$Reader.<init>(
>        at$Reader.<init>(
>        at$Reader.<init>(
>        at
> org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(
>        at
>        at org.apache.nutch.crawl.WebTableReader.processStatJob(
>        at
>        at
>        at org.apache.nutch.crawl.WebTableReader.main(
> --> readdb -dump works.

Confirmed and ticket opened as NUTCH-1391

> % ./bin/nutch fetch 1339621550-203073321 -threads 1 -parse
> Exception in thread "main" java.lang.IllegalArgumentException: arg -parse not recognized

The parse argument was removed in Nutch 2.0 and now throws an
illegalargumentexception. This is now normal. To enable parsing during
fetching please set config in nutch-site.xml. The reason that the
incorrect -parse argument is till in the Usage message, is because I
was not diligent enough when patching the fetcher CLI aesthetics. I'll
address this within the issue below as well.

> % ./bin/nutch parse -all -force -resume
> ParserJob: starting
> ParserJob: resuming:    false           <<< -resume and
> ParserJob: forced reparse:      false   <<< -force obviously ignored ?
> ParserJob: parsing all

Yes confirmed and ticket opened as NUTCH-1392

> % ./bin/nutch generate
> --> generates batchid, but should show help as in 1.x ?
> --> is there an option -topN ?

Yes this is opened in NUTCH-1393. Users may not necessarily wish to
generate at all, instead wishing to merely find out the GeneratorJob
CLI options... I will open this just now and fix for 2.1.

> The 2.0 Solr schema and mappings still contain the field "site"
> which has been removed in 1.x (NUTCH-1232).
> Should be done also in 2.0: it's easier to maintain only one Solr installation
> for all Nutch versions.

Logged in NUTCH-1394

Thanks Seb for your contributions here... this is exactly what we are after.

Does anyone have issues with running another RC and addressing these
issues in 2.1?


View raw message