nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nic M <nicde...@gmail.com>
Subject Re: IOException in dedup
Date Tue, 02 Jun 2009 17:13:51 GMT

On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:

>> Hello,
>>
>> I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for  
>> Mac OS X. When I try to start crawling I get the following exception:
>>
>> Dedup: starting
>> Dedup: adding indexes in: crawl/indexes
>> Exception in thread "main" java.io.IOException: Job failed!
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
>> 604)
>>         at  
>> org 
>> .apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java: 
>> 439)
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>>
>>
>> Does anyone know how to solve this problem?
>
> You can get an IOException reported by Hadoop when the root cause is  
> that you've run out of memory. Normally the hadoop.log file would  
> have the OOM exception.
>
> If you're running from inside of Eclipse, see http://wiki.apache.org/nutch/RunNutchInEclipse0.9

>  for more details.
>
> -- Ken
> -- 
> Ken Krugler
> +1 530-210-6378

Thank you for the pointers Ken. I changed the VM memory parameters as  
shown at http://wiki.apache.org/nutch/RunNutchInEclipse0.9. However, I  
still get the exception and in Hadoop log I have the following exception

2009-06-02 13:08:18,790 INFO  indexer.DeleteDuplicates - Dedup: starting
2009-06-02 13:08:18,817 INFO  indexer.DeleteDuplicates - Dedup: adding  
indexes in: crawl/indexes
2009-06-02 13:08:19,064 WARN  mapred.LocalJobRunner - job_7izmuc
java.lang.ArrayIndexOutOfBoundsException: -1
	at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
	at org.apache.nutch.indexer.DeleteDuplicates$InputFormat 
$DDRecordReader.next(DeleteDuplicates.java:176)
	at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
	at org.apache.hadoop.mapred.LocalJobRunner 
$Job.run(LocalJobRunner.java:126)

I am running Lucene 2.1.0. Any idea why I am getting the  
ArrayIndexOutofBoundsEception?

Nic




Mime
View raw message