Hi Everyone,

 

I am having difficulty finding instructions that I am able to use to do the following. I feel I am missing something simple I overlooked…

1.       Crawl a Website.

2.       Index it on SOLR.

 

For now, I am just stuck on #1 –

 

I am running the following command:

·         bin/crawl -i urls/ TestCrawl/  2



In my urls/seed.txt file I have the following entry:

·         http://nutch.apache.org/

 

In my regexx-urlfilter.txt I have the following entry:

·         +^http://([a-z0-9]*\.)*nutch.apache.org/

 

I am running this command from Cygwin (I am on windows)

·         bin/crawl -i urls/ TestCrawl/  2

 

I have set my java_home env variable and such

 

The error I am getting is:

 

$ bin/crawl -i urls/ TestCrawl/  2

Injecting seed URLs

/cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb urls/

Injector: starting at 2016-06-08 11:04:45

Injector: crawlDb: TestCrawl/crawldb

Injector: urlDir: urls

Injector: Converting injected urls to crawl db entries.

Injector: java.lang.NullPointerException

        at java.lang.ProcessBuilder.start(Unknown Source)

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)

        at org.apache.hadoop.util.Shell.run(Shell.java:418)

        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)

        at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)

        at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)

        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)

        at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)

        at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)

        at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)

        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)

        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)

        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Unknown Source)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)

        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)

        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)

        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Unknown Source)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)

        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)

        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)

        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)

        at org.apache.nutch.crawl.Injector.inject(Injector.java:323)

        at org.apache.nutch.crawl.Injector.run(Injector.java:379)

        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

        at org.apache.nutch.crawl.Injector.main(Injector.java:369)