nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "RunningNutchAndSolr" by amitkumar
Date Thu, 14 May 2009 17:28:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by amitkumar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

------------------------------------------------------------------------------
   * apt-get install sun-java6-jdk subversion ant patch unzip
  
  == Steps ==
-  Setup
  
  The first step to get started is to download the required software components, namely Apache
Solr and Nutch.
  
- 1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page
+ '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page
  
- 2. Extract Solr package
+ '''2.''' Extract Solr package
  
- 3. Download Nutch version 1.0 or later (Alternatively download the the nightly version of
Nutch that contains the required functionality)
+ '''3.''' Download Nutch version 1.0 or later (Alternatively download the the nightly version
of Nutch that contains the required functionality)
  
- 4. Extract the Nutch package
+ '''4.''' Extract the Nutch package       tar xzf apache-nutch-1.0.tar.gz
  
- tar xzf apache-nutch-1.0.tar.gz
- 
- 5. Configure Solr
+ '''5.''' Configure Solr
- 
  For the sake of simplicity we are going to use the example
  configuration of Solr as a base.
  
- a. Copy the provided Nutch schema from directory
+ '''a.''' Copy the provided Nutch schema from directory
  apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing
file)
  
  We want to allow Solr to create the snippets for search results so we need to store the
content in addition to indexing it:
  
- b. Change schema.xml so that the stored attribute of field “content” is true.
+ '''b.''' Change schema.xml so that the stored attribute of field “content” is true.
  
  <field name=”content” type=”text” stored=”true” indexed=”true”/>
  
  We want to be able to tweak the relevancy of queries easily so we’ll create new dismax
request handler configuration for our use case:
  
- d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment
to it
+ '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment
to it
  
  <requestHandler name="/nutch" class="solr.SearchHandler" >
  
@@ -93, +89 @@

  
  </requestHandler>
  
- 6. Start Solr
+ '''6.''' Start Solr
  
  cd apache-solr-1.3.0/example
  java -jar start.jar
  
- 7. Configure Nutch
+ '''7. Configure Nutch'''
  
  a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with
the following (we specify our crawler name, active plugins and limit maximum url count for
single host per run to be 100) :
  
  <?xml version="1.0"?>
  <configuration>
+ 
  <property>
+ 
  <name>http.agent.name</name>
+ 
  <value>nutch-solr-integration</value>
+ 
  </property>
+ 
  <property>
  <name>generate.max.per.host</name>
+ 
  <value>100</value>
+ 
  </property>
+ 
  <property>
+ 
  <name>plugin.includes</name>
+ 
  <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+ 
  </property>
+ 
  </configuration>
  
+ 
- b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,
+ '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace it’s content
with following:
- replace it’s content with following:
  
  -^(https|telnet|file|ftp|mailto):
   
@@ -135, +143 @@

  # deny anything else
  -.
  
- 8. Create a seed list (the initial urls to fetch)
+ '''8.''' Create a seed list (the initial urls to fetch)
  
  mkdir urls
  echo "http://www.lucidimagination.com/" > urls/seed.txt
  
- 9. Inject seed url(s) to nutch crawldb (execute in nutch directory)
+ '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory)
  
  bin/nutch inject crawl/crawldb urls
  
- 10. Generate fetch list, fetch and parse content
+ '''10.''' Generate fetch list, fetch and parse content
  
  bin/nutch generate crawl/crawldb crawl/segments
  
@@ -166, +174 @@

  
  Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to
get some more content.
  
- 11. Create linkdb
+ '''11.''' Create linkdb
  
  bin/nutch invertlinks crawl/linkdb -dir crawl/segments
  
- 12. Finally index all content from all segments to Solr
+ '''12.''' Finally index all content from all segments to Solr
  
  bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
  

Mime
View raw message