nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jamal, Sarfaraz" <>
Subject Newbie Questions
Date Wed, 08 Jun 2016 15:07:30 GMT
Hi Everyone,

I am having difficulty finding instructions that I am able to use to do the following. I feel
I am missing something simple I overlooked...

1.       Crawl a Website.

2.       Index it on SOLR.

For now, I am just stuck on #1 -

I am running the following command:

*         bin/crawl -i urls/ TestCrawl/  2

In my urls/seed.txt file I have the following entry:


In my regexx-urlfilter.txt I have the following entry:

*         +^http://([a-z0-9]*\.)*

I am running this command from Cygwin (I am on windows)

*         bin/crawl -i urls/ TestCrawl/  2

I have set my java_home env variable and such

The error I am getting is:

$ bin/crawl -i urls/ TestCrawl/  2
Injecting seed URLs
/cygdrive/c/apache-nutch-1.11/bin/nutch inject TestCrawl//crawldb urls/
Injector: starting at 2016-06-08 11:04:45
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
        at java.lang.ProcessBuilder.start(Unknown Source)
        at org.apache.hadoop.util.Shell.runCommand(
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
        at org.apache.hadoop.util.Shell.execCommand(
        at org.apache.hadoop.util.Shell.execCommand(
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
        at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
        at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
        at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(
        at org.apache.hadoop.mapreduce.Job$
        at org.apache.hadoop.mapreduce.Job$
        at Method)
        at Source)
        at org.apache.hadoop.mapreduce.Job.submit(
        at org.apache.hadoop.mapred.JobClient$
        at org.apache.hadoop.mapred.JobClient$
        at Method)
        at Source)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(
        at org.apache.hadoop.mapred.JobClient.submitJob(
        at org.apache.hadoop.mapred.JobClient.runJob(
        at org.apache.nutch.crawl.Injector.inject(
        at org.apache.nutch.crawl.Injector.main(

View raw message