nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script
Date Tue, 10 Jul 2012 13:14:43 GMT


Markus Jelsma commented on NUTCH-1087:

Works nicely but it cannot be run from the runtime/local directory. The wiki usually describes
commands to be run from there.

{code}$ bin/crawl urls/ crawl/crawldb http://localhost:8983/solr 2
bin/crawl: line 89: ./nutch: No such file or directory{code}

All goes well until invertlinks:

{code}LinkDb: starting at 2012-07-10 15:09:12
LinkDb: linkdb: ../crawl/crawldb/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: 20120710150834
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/markus/trunk/runtime/local/bin/20120710150834/parse_data
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(
        at org.apache.hadoop.mapred.JobClient.writeOldSplits(
        at org.apache.hadoop.mapred.JobClient.writeSplits(
        at org.apache.hadoop.mapred.JobClient.access$600(
        at org.apache.hadoop.mapred.JobClient$
        at org.apache.hadoop.mapred.JobClient$
        at Method)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(
        at org.apache.hadoop.mapred.JobClient.submitJob(
        at org.apache.hadoop.mapred.JobClient.runJob(
        at org.apache.nutch.crawl.LinkDb.invert(
        at org.apache.nutch.crawl.LinkDb.main({code}

I also think 2GB heap space for childs is far too much for common installations.

> Deprecate crawl command and replace with example script
> -------------------------------------------------------
>                 Key: NUTCH-1087
>                 URL:
>             Project: Nutch
>          Issue Type: Task
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.6
>         Attachments: NUTCH-1087.patch
> * remove the crawl command
> * add basic crawl shell script
> See thread:

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message