nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "Tutorial on incremental crawling" by Gabriele Kahlout
Date Sun, 27 Mar 2011 12:52:11 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "Tutorial on incremental crawling" page has been changed by Gabriele Kahlout.
The comment on this change is:   .
http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling?action=diff&rev1=3&rev2=4

--------------------------------------------------

  Follow 2 script:
  
  1. Abridged script using Solr;
+ 
+ {{{
+ #!/bin/sh
+ 
+ #
+ # Created by Gabriele Kahlout on 27.03.11.
+ # The following script crawls the whole-web incrementally; Specifying a list of urls to
crawl, nutch will continuously fetch $it_size urls from a specified list of urls, index and
merge them with our whole-web index,  so that they can be immediately searched, until all
urls have been fetched.
+ #
+ # TO USE:
+ # 1. $ mv whole-web-crawling-incremental $NUTCH_HOME/whole-web-crawling-incremental
+ # 2. $ cd $NUTCH_HOME
+ # 3. $ chmod +x whole-web-crawling-incremental
+ # 4. $ ./whole-web-crawling-incremental
+ 
+ # Usage: ./whole-web-crawling-incremental [it_seedsDir-path urls-to-fetch-per-iteration
depth]
+ # Start
+ 
+ rm -r crawl # fresh crawl
+ 
+ seedsDir=$1
+ it_size=$2
+ depth=$3
+ 
+ indexedPlus1=1 #indexedPlus1 urls+1 because of tail. Never printed out
+ it_seedsDir="$seedsDir/it_seeds"
+ rm -r $it_seedsDir
+ mkdir $it_seedsDir
+ 
+ allUrls=`cat $seedsDir/*url* | wc -l | sed -e "s/^ *//"`
+ echo $allUrls" urls to crawl"
+ 
+ it_crawldb="crawl/crawldb"
+ 
+ 
+ while [[ $indexedPlus1 -le $allUrls ]]
+ do
+ 	rm $it_seedsDir/urls
+ 	tail -n+$indexedPlus1 $seedsDir/*url* | head -n$it_size > $it_seedsDir/urls
+ 	
+ 	bin/nutch inject $it_crawldb $it_seedsDir
+ 	i=0
+ 	
+ 	while [[ $i -lt $depth ]]
+ 	do		
+ 		cmd="bin/nutch generate $it_crawldb crawl/segments -topN $it_size"
+ 		$cmd
+ 		output=`$cmd`
+ 		if [[ $output == *'0 records selected for fetching'* ]]
+ 		then
+ 			break;
+ 		fi
+ 		s1=`ls -d crawl/segments/2* | tail -1`
+ 
+ 		bin/nutch fetch $s1
+ 
+ 		bin/nutch updatedb $it_crawldb $s1
+ 
+ 		bin/nutch invertlinks crawl/linkdb -dir crawl/segments
+ 
+ 		bin/nutch solrindex http://localhost:8080/solr/ $it_crawldb crawl/linkdb crawl/segments/*
+ 				
+ 		((i++))
+ 		((indexedPlus1+=$it_size))
+ 	done
+ done
+ rm -r $it_seedsDir
+ 
+ }}}
  
  2. Unabridged script with explanations and using nutch index.
  
  == 1. Abridged script using Solr ==
  == 2. Unabridged script with explanations and using nutch index ==
  {{{
+ 
  #!/bin/sh
  
  #
@@ -87, +156 @@

  		echo
  		cmd="bin/nutch generate $it_crawldb crawl/segments -topN $it_size"
  		echo $cmd
+ 		$cmd
  		output=`$cmd`
  		echo $output
  		if [[ $output == *'0 records selected for fetching'* ]] #all the urls of this iteration
have been fetched
@@ -149, +219 @@

  rm -r $crawl_dump $it_seedsDir
  echoThenRun "bin/nutch readdb $allcrawldb -dump $crawl_dump" # you can inspect the dump
with $ vim $crawl_dump
  bin/nutch readdb $allcrawldb -stats
+ 
  }}}
  

Mime
View raw message