nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "NutchTutorial" by LewisJohnMcgibbney
Date Mon, 02 Mar 2015 18:02:20 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchTutorial" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=78&rev2=79

   * Java Runtime/Development Environment (1.7)
   * (Source build only) Apache Ant: http://ant.apache.org/
  
- == 1. Install Nutch ==
+ == Install Nutch ==
  === Option 1: Setup Nutch from a binary distribution ===
   * Download a binary package (`apache-nutch-1.X-bin.zip`) from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]].
   * Unzip your binary Nutch package. There should be a folder `apache-nutch-1.X`.
@@ -43, +43 @@

   * config files should be modified in `apache-nutch-1.X/runtime/local/conf/`
   * `ant clean` will remove this directory (keep copies of modified config files)
  
- == 2. Verify your Nutch installation ==
+ == Verify your Nutch installation ==
   * run "`bin/nutch`" - You can confirm a correct installation if you see something similar
to the following:
  
  {{{
@@ -93, +93 @@

  
  Note that the `LMC-032857` above should be replaced with your machine name.
  
- == 3. Crawl your first website ==
+ == Crawl your first website ==
  Nutch requires two configuration changes before a website can be crawled:
  
   1. Customize your crawl properties, where at a minimum, you provide a name for your crawler
for external servers to recognize
   1. Set a seed list of URLs to crawl
  
- === 3.1 Customize your crawl properties ===
+ === Customize your crawl properties ===
   * Default crawl properties can be viewed and edited within `conf/nutch-default.xml `- where
most of these can be used without modification
   * The file `conf/nutch-site.xml` serves as a place to add your own custom crawl properties
that overwrite `conf/nutch-default.xml`. The only required modification for this file is to
override the `value` field of the `http.agent.name     `
    . i.e. Add your agent name in the `value` field of the `http.agent.name` property in `conf/nutch-site.xml`,
for example:
@@ -110, +110 @@

   <value>My Nutch Spider</value>
  </property>
  }}}
- === 3.2 Create a URL seed list ===
+ === Create a URL seed list ===
   * A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl
   * The file `conf/regex-urlfilter.txt` will provide Regular Expressions that allow nutch
to filter and narrow the types of web resources to crawl and download
  
@@ -272, +272 @@

       Usage: bin/nutch solrclean <crawldb> <solrurl>
       Example: /bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr
  }}}
- === 3.5. Using the crawl script ===
+ === Using the crawl script ===
- If you have followed the 3.2 section above on how the crawling can be done step by step,
you might be wondering how a bash script can be written to automate all the process described
above.
+ If you have followed the section above on how the crawling can be done step by step, you
might be wondering how a bash script can be written to automate all the process described
above.
  
  Nutch developers have written one for you :), and it is available at [[bin/crawl]].
  
@@ -283, +283 @@

  }}}
  The crawl script has lot of parameters set, and you can modify the parameters to your needs.
It would be ideal to understand the parameters before setting up big crawls.
  
- == 4. Setup Solr for search ==
+ == Setup Solr for search ==
   * download binary file from [[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
-  * unzip to `$HOME/apache-solr-3.X`, we will now refer to this as `${APACHE_SOLR_HOME}`
+  * unzip to `$HOME/apache-solr`, we will now refer to this as `${APACHE_SOLR_HOME}`
   * `cd ${APACHE_SOLR_HOME}/example`
   * `java -jar start.jar`
  
- == 5. Verify Solr installation ==
+ == Verify Solr installation ==
  After you started Solr admin console, you should be able to access the following links:
  
  {{{
  http://localhost:8983/solr/#/
  }}}
- == 6. Integrate Solr with Nutch ==
+ == Integrate Solr with Nutch ==
  We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl
data from the seed URL(s). Below are the steps to delegate searching to Solr for links to
be searchable:
  
   * Backup the original Solr example schema.xml:<<BR>>

Mime
View raw message