lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Baruch Kogan <>
Subject Integrating Solr with Nutch
Date Sun, 01 Mar 2015 17:56:08 GMT
Hi, guys,

I'm working through the tutorial here
I've run a crawl on a list of webpages. Now I'm trying to index them into
Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns
queries. I've edited the Nutch schema as per instructions. Now I hit a wall:


   Save the file and restart Solr under ${APACHE_SOLR_HOME}/example:

   java -jar start.jar\

On my install (the latest Solr,) there is no such file, but there is a file in the /bin which I can start. So I pasted it into
solr/example/ and ran it from there. Solr cranks over. Now I need to:


   run the Solr Index command from ${NUTCH_RUNTIME_HOME}:

   bin/nutch solrindex crawl/crawldb
-linkdb crawl/linkdb crawl/segments/

and I get this:

*ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex <> crawl/crawldb
-linkdb crawl/linkdb crawl/segments/*
*Indexer: starting at 2015-03-01 19:51:09*
*Indexer: deleting gone documents: false*
*Indexer: URL filtering: false*
*Indexer: URL normalizing: false*
*Active IndexWriters :*
* solr.server.url : URL of the SOLR instance (mandatory)*
* solr.commit.size : buffer size when sending to SOLR (default 1000)*
* solr.mapping.file : name of the mapping file for fields (default
* solr.auth : use authentication (default false)*
* solr.auth.username : use authentication (default false)*
* solr.auth : username for authentication*
* solr.auth.password : password for authentication*

*Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch*
*Input path does not exist:
*Input path does not exist:
*Input path does not exist:
*Input path does not exist:
*Input path does not exist:
* at
* at
* at
* at org.apache.hadoop.mapred.JobClient.writeOldSplits(*
* at org.apache.hadoop.mapred.JobClient.writeSplits(*
* at org.apache.hadoop.mapred.JobClient.access$700(*
* at org.apache.hadoop.mapred.JobClient$*
* at org.apache.hadoop.mapred.JobClient$*
* at Method)*
* at*
* at*
* at
* at org.apache.hadoop.mapred.JobClient.submitJob(*
* at org.apache.hadoop.mapred.JobClient.runJob(*
* at org.apache.nutch.indexer.IndexingJob.index(*
* at*
* at*
* at org.apache.nutch.indexer.IndexingJob.main(*

What am I doing wrong?


Baruch Kogan
Marketing Manager
Seller Panda <>
baruch.kogan at Skype

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message