nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "NutchTutorial" by WayneBurke
Date Wed, 08 Oct 2014 20:22:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchTutorial" page has been changed by WayneBurke:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=71&rev2=72

Comment:
typos corrected

   * `ant clean` will remove this directory (keep copies of modified config files)
  
  == 2. Verify your Nutch installation ==
-  * run "`bin/nutch`" - You can confirm a correct installation if you seeing similar to the
following:
+  * run "`bin/nutch`" - You can confirm a correct installation if you see something similar
to the following:
  
  {{{
  Usage: nutch COMMAND where command is one of:
@@ -154, +154 @@

  === 3.4 Using Individual Commands for Whole-Web Crawling ===
  '''NOTE''': If you previously modified the file `conf/regex-urlfilter.txt` as covered [[#A3._Crawl_your_first_website|here]]
you will need to change it back.
  
- Whole-Web crawling is designed to handle very large crawls which may take weeks to complete,
running on multiple machines.  This also permits more control over the crawl process, and
incremental crawling.  It is important to note that whole Web crawling does not necessarily
mean crawling the entire World Wide Web.  We can limit a whole Web crawl to just a list of
the URLs we want to crawl.  This is done by using a filter just like we the one we used when
we did the `crawl` command (above).
+ Whole-Web crawling is designed to handle very large crawls which may take weeks to complete,
running on multiple machines.  This also permits more control over the crawl process, and
incremental crawling.  It is important to note that whole Web crawling does not necessarily
mean crawling the entire World Wide Web.  We can limit a whole Web crawl to just a list of
the URLs we want to crawl.  This is done by using a filter just like the one we used when
we did the `crawl` command (above).
  
  ==== Step-by-Step: Concepts ====
  Nutch data is composed of:
@@ -260, +260 @@

  ==== Step-by-Step: Indexing into Apache Solr ====
  Note: For this step you should have Solr installation. If you didn't integrate Nutch with
Solr. You should read [[#A4._Setup_Solr_for_search|here]].
  
- Now we are ready!!! To go on and index the all the resources. For more information see [[http://wiki.apache.org/nutch/bin/nutch%20solrindex|this
paper]]
+ Now we are ready to go on and index all the resources. For more information see [[http://wiki.apache.org/nutch/bin/nutch%20solrindex|this
paper]]
  
  {{{
       Usage: bin/nutch solrindex <solr url> <crawldb> [-linkdb <linkdb>][-params
k1=v1&k2=v2...] (<segment> ...| -dir <segments>) [-noCommit] [-deleteGone]
[-filter] [-normalize]

Mime
View raw message