nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nutch Newbie <nutch.new...@gmail.com>
Subject Re: Tried to run Crawl with depth of only 2 and getting IOException
Date Wed, 20 Jan 2010 19:40:13 GMT
On Wed, Jan 20, 2010 at 7:10 PM, kraman <kirthi.raman@gmail.com> wrote:
>
> kirthi10@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl -depth
> 2
> crawl started in: tinycrawl
> rootUrlDir = url
> threads = 10
> depth = 2
> Injector: starting
> Injector: crawlDb: tinycrawl/crawldb
> Injector: urlDir: url
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: tinycrawl/segments/20100120130316
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: tinycrawl/segments/20100120130316
> Fetcher: threads: 10
> fetching http://www.mywebsite.us/
> fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException:
> Agent name not configured!

You need to fix nutch config file as per README.




> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: tinycrawl/crawldb
> CrawlDb update: segments: [tinycrawl/segments/20100120130316]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: tinycrawl/segments/20100120130323
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: tinycrawl/segments/20100120130323
> Fetcher: threads: 10
> fetching http://www.mywebsite.us/
> fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException:
> Agent name not configured!
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: tinycrawl/crawldb
> CrawlDb update: segments: [tinycrawl/segments/20100120130323]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: tinycrawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: tinycrawl/segments/20100120130323
> LinkDb: adding segment: tinycrawl/segments/20100120130316
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: tinycrawl/linkdb
> Indexer: adding segment: tinycrawl/segments/20100120130323
> Indexer: adding segment: tinycrawl/segments/20100120130316
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: tinycrawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
> LogFile gives
> java.lang.ArrayIndexOutOfBoundsException: -1
>        at
> org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
>        at
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
>        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> --
> View this message in context: http://old.nabble.com/Tried-to-run-Crawl-with-depth-of-only-2-and-getting-IOException-tp27246959p27246959.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>

Mime
View raw message