nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "Tutorial on incremental crawling" by Gabriele Kahlout
Date Sun, 27 Mar 2011 13:24:15 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "Tutorial on incremental crawling" page has been changed by Gabriele Kahlout.
http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling?action=diff&rev1=8&rev2=9

--------------------------------------------------

  The following scripts crawl the whole-web incrementally; Specifying a list of urls to crawl,
nutch will continuously fetch $it_size urls from a specified list of urls, index and merge
them with our whole-web index,  so that they can be immediately searched, until all urls have
been fetched.
  
- Tested with Nutch-1.2 release. Please report any bug you find on the mailing list and to
me [[Gabriele Kahlout|me]].
+ Tested with Nutch-1.2 release [[Incremental Crawling Scripts Test][Output]. Please report
any bug you find on the mailing list and to [[Gabriele Kahlout|me]].
+ 
  
  If not ready, follow [[Tutorial]] to setup and configure Nutch on your machine.
  
@@ -57, +58 @@

  	while [[ $i -lt $depth ]]
  	do		
  		cmd="bin/nutch generate $it_crawldb crawl/segments -topN $it_size"
- 		$cmd
  		output=`$cmd`
  		if [[ $output == *'0 records selected for fetching'* ]]
  		then
@@ -157, +157 @@

  		echo
  		cmd="bin/nutch generate $it_crawldb crawl/segments -topN $it_size"
  		echo $cmd
- 		$cmd
  		output=`$cmd`
  		echo $output
  		if [[ $output == *'0 records selected for fetching'* ]] #all the urls of this iteration
have been fetched

Mime
View raw message