nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "RunningNutchAndSolr" by amitkumar
Date Thu, 14 May 2009 17:16:34 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by amitkumar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

------------------------------------------------------------------------------
   * apt-get install sun-java6-jdk subversion ant patch unzip
  
  == Steps ==
+  Setup
-  1. Check out solr-trunk ( svn co http://svn.apache.org/repos/asf/lucene/solr/trunk/ solr-trunk
)
-  1. Check out nutch-trunk ( svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk/ nutch-trunk
)
-  1. Go into the solr-trunk and run 'ant dist dist-solrj'
-  1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar from solr-trunk/dist
to nutch-trunk/lib
-  1. Apply patch from [http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch FooFactory
patch] to nutch-trunk (cd nutch-trunk; wget http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch;
patch -p0 < nutch_solr.patch)
-  1. Get zip file from [http://variogram.com/latest/SolrIndexer.zip Variogr.am] and unzip
somewhere other than nutch-trunk
-  1. Copy ONLY SolrIndexer.java from src/java/org/apache/nutch/indexer/ to nutch-trunk/src/java/org/apache/nutch/indexer
-  1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/SolrIndexer.java (somewhere around
line 92):
-    * Replace int res = new SolrIndexer().doMain(NutchConfiguration.create(), args); with
int res = ToolRunner.run(NutchConfiguration.create(), new SolrIndexer(), args);
-    * Edit the imports to pick up org.apache.hadoop.util.ToolRunner
-  1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing scope on LuceneDocumentWrapper
from private to protected
-  1. Get the zip file from [http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
 FooFactory] for SOLR-20
-  1. Unzip solr-client.zip somewhere, go into java/solrj and run 'ant'
-  1. Copy solr-client.jar from dist to nutch-trunk/lib
-  1. Copy xpp3-1.1.3.4.0.jar from lib to nutch-trunk/lib
-  1. Configure nutch-trunk/conf/nutch-site.xml with *at least* settings for your site including
a value for property indexer.solr.url (something like http://localhost:8983/solr/), but you
should also have http.agent.name, http.agnet.description, http.agent.url, and http.agent.email
as well.
-  1. Edit nutch-trunk/conf/regex-urlfilter.xml to include some pattern for what to grab (such
as +^http://([a-z0-9]*\.)apache.org/)
-  1. Configure some url(s) to crawl (make a nutch-trunk/urls directory with a text file with
just a url in it like http://lucene.apache.org/nutch)
-  1. Copy [http://www.foofactory.fi/files/nutch-solr/crawl.sh Crawl.sh script] from FooFactory
and copy it to nutch-trunk/bin (editing if needed for things like topN)
-  1. Go into solr-trunk and make an example server instance (run 'ant example')
-  1. Copy example off somewhere (like /tmp/mysolr)
-  1. Edit mysolr/solr/conf/schema.xml
-    * Add the fields that Nutch needs (url, content, segment, digest, host, site, anchor,
title, tstamp, text--see [http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
FooFactory Article on Nutch + Solr])
-    * Change defaultSearchField to 'text'
-    * Change defaultOperator to 'AND'
-    * Add lines to "copyField" section to copy anchor, title, and content into the text field
-  1. Start the Solr you just made (cd /tmp/mysolr; java -jar start.jar)
-  1. Run a Nutch crawl using the bin/crawl.sh script.
  
- If you watch the output from your Solr instance (logs) you should see a bunch of messages
scroll by when Nutch finishes crawling and posts new documents.  If not, then you've got something
not configured right.  I'll try to add more notes here as people have questions/issues.
+ The first step to get started is to download the required software components, namely Apache
Solr and Nutch.
  
+ 1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page
- '''Troubleshooting:'''
-  * If you get errors about "Type mismatch in value from map:" (expected ObjectWritable,
but received NutchWritable), then you likely are missing the two steps I just added in step
9 above.  Sorry about that, I forgot about making the change there in SolrIndexer.
-  * Note: Double-mistaken.  I've re-written the order of the steps.  Turns out you do need
both the Variogram file and the FooFactory files. 
-  * When in doubt, look at nutch-trunk/logs/hadoop.log .  It frequently shows details about
what's gone wrong and can be a big help when you start getting "unexplained" errors.
-  * See original articles at [http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
FooFactory Article on Nutch + Solr] and [http://variogram.com/latest/?p=26 Variogr.am Updates
to FooFactory Posting]
- ---------------------------------------------------
- ERROR
- I did everything but i got this error any idea??
  
+ 2. Extract Solr package
- 2008-04-03 15:42:28,009 WARN  mapred.LocalJobRunner - job_local_1
- java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.ObjectWritable,
recieved org.apache.nutch.crawl.NutchWritable
-         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:369)
-         at org.apache.nutch.indexer.Indexer.map(Indexer.java:344)
-         at org.apache.nutch.indexer.Indexer.map(Indexer.java:52)
-         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
-         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
-         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
- 2008-04-03 15:42:28,609 FATAL indexer.Indexer - SolrIndexer: java.io.IOException: Job failed!
-         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
-         at org.apache.nutch.indexer.SolrIndexer.index(SolrIndexer.java:86)
-         at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:111)
-         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
-         at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:93)
- -------------------------------------------------------
- Sorry but nothing change!! Same as below..
  
+ 3. Download Nutch version 1.0 or later (Alternatively download the the nightly version of
Nutch that contains the required functionality)
- ERROR
- I changed lines and it worked.But this time gave this error. I tried both private and protected
scopes but nothing changed.
- I also changed this line  Document doc = (Document) ((ObjectWritable) value).get(); with
this  Document doc = (Document) ((NutchWritable) value).get(); this time gave build error..
  
- 2008-04-04 10:41:48,490 WARN  mapred.LocalJobRunner - job_local_1
+ 4. Extract the Nutch package
  
+ tar xzf apache-nutch-1.0.tar.gz
- java.lang.ClassCastException: org.apache.nutch.indexer.Indexer$LuceneDocumentWrapper
-         at org.apache.nutch.indexer.SolrIndexer$OutputFormat$1.write(SolrIndexer.java:135)
-         at org.apache.hadoop.mapred.ReduceTask$2.collect(ReduceTask.java:315)
-         at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:275)
-         at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52)
-         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333)
-         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164)
- 2008-04-04 10:41:49,085 FATAL indexer.Indexer - SolrIndexer: java.io.IOException: Job failed!
-         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
-         at org.apache.nutch.indexer.SolrIndexer.index(SolrIndexer.java:87)
-         at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:112)
-         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
-         at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:94)
- --------------------------------------------------------------------------
  
- It works like a charm thanks for your help.
- (I repeated a mistake in nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java file

- It is explained here http://variogram.com/latest/?p=26
- +++ src/java/org/apache/nutch/indexer/Indexer.java      (working copy)
- -  private static class LuceneDocumentWrapper implements Writable {
- +  public static class LuceneDocumentWrapper implements Writable { ).
+ 5. Configure Solr
+ 
+ For the sake of simplicity we are going to use the example
+ configuration of Solr as a base.
+ 
+ a. Copy the provided Nutch schema from directory
+ apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing
file)
+ 
+ We want to allow Solr to create the snippets for search results so we need to store the
content in addition to indexing it:
+ 
+ b. Change schema.xml so that the stored attribute of field “content” is true.
+ 
+ <field name=”content” type=”text” stored=”true” indexed=”true”/>
+ 
+ We want to be able to tweak the relevancy of queries easily so we’ll create new dismax
request handler configuration for our use case:
+ 
+ d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment
to it
+ 
+ <requestHandler name="/nutch" class="solr.SearchHandler" >
+ <lst name="defaults">
+ <str name="defType">dismax</str>
+ <str name="echoParams">explicit</str>
+ <float name="tie">0.01</float>
+ <str name="qf">
+ content^0.5 anchor^1.0 title^1.2
+ </str>
+ <str name="pf">
+ content^0.5 anchor^1.5 title^1.2 site^1.5
+ </str>
+ <str name="fl">
+ url
+ </str>
+ <str name="mm">
+ 2&lt;-1 5&lt;-2 6&lt;90%
+ </str>
+ <int name="ps">100</int>
+ <bool hl="true"/>
+ <str name="q.alt">*:*</str>
+ <str name="hl.fl">title url content</str>
+ <str name="f.title.hl.fragsize">0</str>
+ <str name="f.title.hl.alternateField">title</str>
+ <str name="f.url.hl.fragsize">0</str>
+ <str name="f.url.hl.alternateField">url</str>
+ <str name="f.content.hl.fragmenter">regex</str>
+ </lst>
+ </requestHandler>
+ 
+ 6. Start Solr
+ 
+ cd apache-solr-1.3.0/example
+ java -jar start.jar
+ 
+ 7. Configure Nutch
+ 
+ a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with
the following (we specify our crawler name, active plugins and limit maximum url count for
single host per run to be 100) :
+ 
+ <?xml version="1.0"?>
+ <configuration>
+ <property>
+ <name>http.agent.name</name>
+ <value>nutch-solr-integration</value>
+ </property>
+ <property>
+ <name>generate.max.per.host</name>
+ <value>100</value>
+ </property>
+ <property>
+ <name>plugin.includes</name>
+ <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+ </property>
+ </configuration>
+ 
+ b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,
+ replace it’s content with following:
+ 
+ -^(https|telnet|file|ftp|mailto):
+  
+ # skip some suffixes
+ -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+  
+ # skip URLs containing certain characters as probable queries, etc.
+ -[?*!@=]
+  
+ # allow urls in foofactory.fi domain
+ +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
+  
+ # deny anything else
+ -.
+ 
+ 8. Create a seed list (the initial urls to fetch)
+ 
+ mkdir urls
+ echo "http://www.lucidimagination.com/" > urls/seed.txt
+ 
+ 9. Inject seed url(s) to nutch crawldb (execute in nutch directory)
+ 
+ bin/nutch inject crawl/crawldb urls
+ 
+ 10. Generate fetch list, fetch and parse content
+ 
+ bin/nutch generate crawl/crawldb crawl/segments
+ 
+ The above command will generate a new segment directory under crawl/segments that at this
point contains files that store the url(s) to be fetched. In the following commands we need
the latest segment dir as parameter so we’ll store it in an environment variable:
+ 
+ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
+ 
+ Now I launch the fetcher that actually goes to get the content:
+ 
+ bin/nutch fetch $SEGMENT -noParsing
+ 
+ Next I parse the content:
+ 
+ bin/nutch parse $SEGMENT
+ 
+ Then I update the Nutch crawldb. The updatedb command wil store all new urls discovered
during the fetch and parse of the previous segment into Nutch database so they can be fetched
later. Nutch also stores information about the pages that were fetched so the same urls won’t
be fetched again and again.
+ 
+ bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
+ 
+ Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to
get some more content.
+ 
+ 11. Create linkdb
+ 
+ bin/nutch invertlinks crawl/linkdb -dir crawl/segments
+ 
+ 12. Finally index all content from all segments to Solr
+ 
+ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
+ 
+ Now the indexed content is available through Solr. You can try to execute searches from
the Solr admin ui from
+ 
+ http://127.0.0.1:8983/solr/admin
+ 
+ , or directly with url like
+ 
+ http://127.0.0.1:8983/solr/nutch/?q=solr&amp;version=2.2&amp;start=0&amp;rows=10&amp;indent=on&amp;wt=json
+ 
  
  
  

Mime
View raw message