nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy Galema <ferdy.gal...@kalooga.com>
Subject Re: VOTE Apache Nutch 2.0 RC1
Date Thu, 14 Jun 2012 07:21:31 GMT
Maybe just 1392? I went ahead and made a patch that should fix this. Feel
free to commit or ignore prior to RC2.

On Thu, Jun 14, 2012 at 1:44 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Sebastian,
>
> On Wed, Jun 13, 2012 at 11:30 PM, Sebastian Nagel
> <wastl.nagel@googlemail.com> wrote:
> >I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed.
> > Much simpler than 1.x (no segments!).
>
> :0)
>
> > % ./bin/nutch readdb -stats
> > WebTable statistics start
> > WebTableReader: java.io.EOFException
> >        at java.io.DataInputStream.readFully(DataInputStream.java:197)
> >        at java.io.DataInputStream.readFully(DataInputStream.java:169)
> >        at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
> >        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
> >        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
> >        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
> >        at
> >
> org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:89)
> >        at
> org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:537)
> >        at
> org.apache.nutch.crawl.WebTableReader.processStatJob(WebTableReader.java:218)
> >        at
> org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:479)
> >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >        at
> org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:412)
> > --> readdb -dump works.
>
> Confirmed and ticket opened as NUTCH-1391
>
> > % ./bin/nutch fetch 1339621550-203073321 -threads 1 -parse
> > Exception in thread "main" java.lang.IllegalArgumentException: arg
> -parse not recognized
>
> The parse argument was removed in Nutch 2.0 and now throws an
> illegalargumentexception. This is now normal. To enable parsing during
> fetching please set config in nutch-site.xml. The reason that the
> incorrect -parse argument is till in the Usage message, is because I
> was not diligent enough when patching the fetcher CLI aesthetics. I'll
> address this within the issue below as well.
>
> >
> >
> > % ./bin/nutch parse -all -force -resume
> > ParserJob: starting
> > ParserJob: resuming:    false           <<< -resume and
> > ParserJob: forced reparse:      false   <<< -force obviously ignored ?
> > ParserJob: parsing all
>
> Yes confirmed and ticket opened as NUTCH-1392
>
>
> > % ./bin/nutch generate
> > --> generates batchid, but should show help as in 1.x ?
> > --> is there an option -topN ?
>
> Yes this is opened in NUTCH-1393. Users may not necessarily wish to
> generate at all, instead wishing to merely find out the GeneratorJob
> CLI options... I will open this just now and fix for 2.1.
>
> > The 2.0 Solr schema and mappings still contain the field "site"
> > which has been removed in 1.x (NUTCH-1232).
> > Should be done also in 2.0: it's easier to maintain only one Solr
> installation
> > for all Nutch versions.
>
> Logged in NUTCH-1394
>
> Thanks Seb for your contributions here... this is exactly what we are
> after.
>
> Does anyone have issues with running another RC and addressing these
> issues in 2.1?
>
> --
> Lewis
>

Mime
View raw message