nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Binoy d <binoy...@gmail.com>
Subject Re: Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh crawl with One Seed.
Date Mon, 01 Apr 2013 05:34:21 GMT
Hi Kiran,

I was running the org.apache.nutch.crawl.Crawler class from within eclipse
(Run as configuration option) with usual arguments arguments urls -dir
/home/binoy/lab/dmoz/apache-url -solr http://localhost:8983/solr/  -depth
1  -topN 1
Thanks for tip on remote debugging. It seems the latest 2.x revision is
broken as i just did Update to Head and i am seeing a completely different
exception. Let me revert the workspace and look at it again, though i was
able to consistently reproduce the issue before i did svn update.

Regards,
Binoy



On Sun, Mar 31, 2013 at 8:48 PM, kiran chitturi
<chitturikiran15@gmail.com>wrote:

> Hi Binoy,
>
> Thanks for the reporting on the issue and debugging ?
>
> Did you try using individual commands or crawl script instead of the crawl
> command  ?
>
> You can try running Nutch remotely [1]. This will help you in running
> commands from shell and debug using Eclipse.
>
> [1]
> http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
>
>
> On Sun, Mar 31, 2013 at 11:25 PM, Binoy d <binoyd13@gmail.com> wrote:
>
>> Hi,
>>
>> I have Nutch 2.x set up with Mysql and am seeing a peculiar null pointer
>> exception with a crawl with sample seeds from DMOZ. I decided to do fresh
>> crawl with only  one url as seed and empty webpage table.
>> I am running *org.apache.nutch.crawl.Crawler* from eclipse  with args *urls
>> -dir /home/binoy/lab/dmoz/apache-url -solr http://localhost:8983/solr/
>> -depth 1  -topN 1*
>>
>> the apache-url seed file has only one entry ("http://nutch.apache.org/")
>>
>>
>> I see the following nullpointer exception : Logs :
>> http://pastebin.com/CaqJpPkn
>>
>> With a little debugging from eclipse I see
>>
>>         conf.set(GeneratorJob.BATCH_ID, batchId);
>>
>> in IndexerJob.java createIndexJob method being the root cause.
>>
>> wrapping it in *if(batchId != null)  *seems to solve the issue.
>>
>> I wanted to know if this is  a valid patch. It seems from grep-ing no on
>> else is reading GeneratorJob.BATCH_ID except indexerJob.
>>
>> I am always seeing batchId passed as null for createIndexJob for clean
>> crawls (empty table), which scenario causes it to be not null? and what is
>> the significance generator job batchId for indexing job.
>>
>> It seems a trivial issue and hence I didnot create a jira. I have
>> attached the small patch and would be glad if some one can take a look.
>>
>> Regards,
>> Binoy
>>
>>
>>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>
>
>

Mime
View raw message