nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "NutchTutorial" by WayneBurke
Date Fri, 10 Oct 2014 21:20:57 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchTutorial" page has been changed by WayneBurke:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=72&rev2=73

Comment:
Updated 6. Integrate Solr with Nutch to reflect changes in the expected schema.xml and its
new location in the Solr example directory.

  == 6. Integrate Solr with Nutch ==
  We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl
data from the seed URL(s). Below are the steps to delegate searching to Solr for links to
be searchable:
  
+  * Backup the original Solr example schema.xml:<<BR>>
+  {{{
-  * mv ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml.org
+ mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
+ }}}
+ 
+  * Copy the Nutch specific schema.xml to replace it:
+  {{{
-  * `cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/`
+ cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
+ }}}
+ 
+  * Open the Nutch schema.xml file for editing:<<BR>>
+  {{{
-  * vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
+ vi ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
+ }}}
-  * Copy exactly in 351 line: <field name="_version_" type="long" indexed="true" stored="true"/>
-  * restart Solr with the command “`java -jar start.jar`” under `${APACHE_SOLR_HOME}/example`
-  * run the Solr Index command:
  
+  * Comment out the following lines (53-54) in the file by changing this:
- {{{
+  {{{
+ <!--   <filter class="solr.
+ EnglishPorterFilterFactory" protected="protwords.txt"/> -->
+ }}}
+  to this
+  {{{
+ <!--   <filter class="solr.
+ EnglishPorterFilterFactory" protected="protwords.txt"/> -->
+ }}}
+ 
+  * Add the following line right after the line <field name="id" ... /> (probably at
line 69-70)
+  {{{
+ <field name="_version_" type="long" indexed="true" stored="true"/>
+ }}}
+ 
+  * If you want to see the raw HTML indexed by Solr, change the content field definition
(line 80) to:
+  {{{
+ <field name="content" type="text" stored="true" indexed="true"/>
+ }}}
+  * Save the file and restart Solr under `${APACHE_SOLR_HOME}/example`:
+  {{{
+ java -jar start.jar
+ }}}
+ 
+  * run the Solr Index command from ${NUTCH_RUNTIME_HOME}:<<BR>>
+  {{{
- bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
+ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/
  }}}
+ 
- The call signature for running the solrindex has changed. The linkdb is now optional, so
you need to denote it with a "-linkdb" flag on the command line.
+ * '' Note: If you are familiar with past version of the solrindex, the call signature for
running it has changed. The linkdb is now optional, so you need to denote it with a "-linkdb"
flag on the command line. ''
  
  This will send all crawl data to Solr for indexing. For more information please see [[bin/nutch
solrindex]]
  
- If all has gone to plan, we are now ready to search with http://localhost:8983/solr/admin/.
 If you want to see the raw HTML indexed by Solr, change the content field definition in `schema.xml`
to:
+ If all has gone to plan, you are now ready to search with http://localhost:8983/solr/admin/.
  
- {{{
- <field name="content" type="text" stored="true" indexed="true"/>
- }}}
- 

Mime
View raw message