nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jamal, Sarfaraz" <Sarfaraz.Ja...@VerizonWireless.com>
Subject Newbie Questions
Date Wed, 08 Jun 2016 15:07:30 GMT
Hi Everyone,

I am having difficulty finding instructions that I am able to use to do the following. I feel
I am missing something simple I overlooked...

1.       Crawl a Website.

2.       Index it on SOLR.

For now, I am just stuck on #1 -

I am running the following command:

*         bin/crawl -i urls/ TestCrawl/  2


In my urls/seed.txt file I have the following entry:

*         http://nutch.apache.org/

In my regexx-urlfilter.txt I have the following entry:

*         +^http://([a-z0-9]*\.)*nutch.apache.org/

I am running this command from Cygwin (I am on windows)

*         bin/crawl -i urls/ TestCrawl/  2

I have set my java_home env variable and such

The error I am getting is:

$ bin/crawl -i urls/ TestCrawl/  2
Injecting seed URLs
/cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb urls/
Injector: starting at 2016-06-08 11:04:45
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
        at java.lang.ProcessBuilder.start(Unknown Source)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
        at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
        at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
        at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
        at org.apache.nutch.crawl.Injector.run(Injector.java:379)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:369)


Mime
View raw message