nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "Tutorial on incremental crawling" by Gabriele Kahlout
Date Sun, 27 Mar 2011 12:35:20 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "Tutorial on incremental crawling" page has been changed by Gabriele Kahlout.
http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling?action=diff&rev1=1&rev2=2

--------------------------------------------------

  
  If not ready, follow [[Tutorial]] to setup and configure Nutch on your machine.
  
- It also works with Solr. If you have Solr setup
+ Follow 2 script:
  
- {{{
+ 1. Abridged script using Solr;
+ 
+ 2. Unabridged script with explanations and using nutch index.
+ 
+ == 1. Abridged script using Solr ==
+ == 2. Unabridged script with explanations and using nutch index ==
- #!/bin/sh
+ {{{#!/bin/sh
  #
  # Created by Gabriele Kahlout on 27.03.11.
- # 
+ #
- # The following script crawls the whole-web incrementally; Specifying a list of urls to
crawl, nutch will continuously fetch $it_size urls from a 
+ # The following script crawls the whole-web incrementally; Specifying a list of urls to
crawl, nutch will continuously fetch $it_size urls from a
  # specified list of urls, index and merge them with our whole-web index,  so that they can
be immediately searched, until all urls have been fetched.
  #
  # Usage: ./whole-web-crawling-incremental [it_seedsDir-path urls-to-fetch-per-iteration
depth]
@@ -23, +28 @@

  # 2. $ cd $NUTCH_HOME
  # 3. $ chmod +x whole-web-crawling-incremental
  # 4. $ ./whole-web-crawling-incremental
- # 
+ #
  # Start
  function echoThenRun () { # echo and then run the command
    echo $1
@@ -69, +74 @@

      do
          echo
          echo "generate-fetch-updatedb-invertlinks-index-merge iteration "$i":"
+         echo
-         echoThenRun "bin/nutch generate $it_crawldb crawl/segments -topN $it_size"
+ 	cmd="bin/nutch generate $it_crawldb crawl/segments -topN $it_size"
-         output=`$cmd`
-         echo $output
+ 	echo $cmd
+ 	output=`$cmd`
+ 	echo $output
          if [[ $output == *'0 records selected for fetching'* ]] #all the urls of this iteration
have been fetched
          then
              break;

Mime
View raw message