nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "NutchTutorial" by riverma
Date Wed, 03 Sep 2014 23:41:02 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchTutorial" page has been changed by riverma:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=67&rev2=68

Comment:
Reorganized and fixed confusing text within section 3: crawl your first website

  export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
  }}}
  == 3. Crawl your first website ==
+ Nutch requires two configuration changes before a website can be crawled:
+ 
+  1. Customize your crawl properties, where at a minimum, you provide a name for your crawler
for external servers to recognize
+  1. Set a seed list of URLs to crawl
+ 
+ === 3.1 Customize your crawl properties ===
+  * Default crawl properties can be viewed and edited within `conf/nutch-default.xml `- where
most of these can be used without modification
+  * The file `conf/nutch-site.xml` serves as a place to add your own custom crawl properties
that overwrite `conf/nutch-default.xml`. The only required modification for this file is to
override the `value` field of the `http.agent.name     `
-  * Add your agent name in the `value` field of the `http.agent.name` property in `conf/nutch-site.xml`,
for example:
+   . i.e. Add your agent name in the `value` field of the `http.agent.name` property in `conf/nutch-site.xml`,
for example:
  
  {{{
  <property>
@@ -83, +91 @@

   <value>My Nutch Spider</value>
  </property>
  }}}
+ === 3.2 Create a URL seed list ===
+  * A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl
+  * The file `conf/regex-urlfilter.txt` will provide Regular Expressions that allow nutch
to filter and narrow the types of web resources to crawl and download
+ 
+ ==== Create a URL seed list ====
   * `mkdir -p urls`
   * `cd urls`
   * `touch seed.txt` to create a text file `seed.txt` under `urls/` with the following content
(one URL per line for each site you want Nutch to crawl).
@@ -90, +103 @@

  {{{
  http://nutch.apache.org/
  }}}
+ ==== (Optional) Configure Regular Expression Filters ====
-  * Edit the file `conf/regex-urlfilter.txt` and replace
+ Edit the file `conf/regex-urlfilter.txt` and replace
  
  {{{
  # accept anything else
@@ -103, +117 @@

  }}}
  This will include any URL in the domain `nutch.apache.org`.
  
+ NOTE: Not specifying any domains to include within regex-urlfilter.txt will lead to all
domains linking to your seed URLs file being crawled as well.
+ 
- === 3.1 Using the Crawl Command ===
+ === 3.3 Using the Crawl Command ===
  {{{#!wiki caution
- The crawl command is deprecated. Please see section [[#A3.3._Using_the_crawl_script|3.3]]
on how to use the crawl script that is intended to replace the crawl command.
+ The crawl command is deprecated. Please see section [[#A3.3._Using_the_crawl_script|3.5]]
on how to use the crawl script that is intended to replace the crawl command.
  }}}
  Now we are ready to initiate a crawl, use the following parameters:
  
@@ -134, +150 @@

  
  Typically one starts testing one's configuration by crawling at shallow depths, sharply
limiting the number of pages fetched at each level (`-topN`), and watching the output to check
that desired pages are fetched and undesirable pages are not. Once one is confident of the
configuration, then an appropriate depth for a full crawl is around 10. The number of pages
per level (`-topN`) for a full crawl can be from tens of thousands to millions, depending
on your resources.
  
- === 3.2 Using Individual Commands for Whole-Web Crawling ===
+ === 3.4 Using Individual Commands for Whole-Web Crawling ===
  '''NOTE''': If you previously modified the file `conf/regex-urlfilter.txt` as covered [[#A3._Crawl_your_first_website|here]]
you will need to change it back.
  
  Whole-Web crawling is designed to handle very large crawls which may take weeks to complete,
running on multiple machines.  This also permits more control over the crawl process, and
incremental crawling.  It is important to note that whole Web crawling does not necessarily
mean crawling the entire World Wide Web.  We can limit a whole Web crawl to just a list of
the URLs we want to crawl.  This is done by using a filter just like we the one we used when
we did the `crawl` command (above).
@@ -268, +284 @@

       Usage: bin/nutch solrclean <crawldb> <solrurl>
       Example: /bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr
  }}}
- === 3.3. Using the crawl script ===
+ === 3.5. Using the crawl script ===
  If you have followed the 3.2 section above on how the crawling can be done step by step,
you might be wondering how a bash script can be written to automate all the process described
above.
  
  Nutch developers have written one for you :), and it is available at [[bin/crawl]].

Mime
View raw message