nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "Nutch 0.9 Crawl Script Tutorial" by AlessioTomasino
Date Sun, 18 May 2008 08:23:37 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by AlessioTomasino:
http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial

------------------------------------------------------------------------------
  Please add comments / corrections to this document. 'cause I don't know what the heck I'm
doing yet. :)
  One thing I want to figure out, is if I can inject just a subset of urls of pages that I
know have changed since the last crawl and refetch/index only those pages. I think there is
a way to do this using the adddays parameter maybe? anyone have any insight?
  
+ == How to refetch/index a subset of urls ==
+ 
+ My solution to this common question is to use a filter on the URL we want to refetch and
have those expire using the -adddays option of 'nutch generate' command.
+ In nutch-site.xml you should enable a filter plugin such as urlfilter-regex and specify
the file which contains the regex filter rules:
+ 
+ <property>
+ 
+ <name>plugin.includes</name> 
+ 
+ <value>protocol-http|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url|more)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|feed
|'''''urlfilter-regex'''''</value>
+ 
+ </property> 
+ 
+ <property>
+   <name>urlfilter.regex.file</name>
+ 
+   <value>regex-urlfilter.txt</value>
+ </property>
+ 
+ The file regex-urlfilter.txt can contain any regular expression, including one or more specific
URLs we want to refetch/index, e.g.:
+ 
+ +http://myhostname/myurl.html
+ 
+ At this stage we can use the command "$NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments
-adddays 31" to generate a segment and the output should look like:
+ 
+ Fetcher: starting
+ 
+ Fetcher: segment: crawl/segments/20080518090826
+ 
+ Fetcher: threads: 50
+ 
+ fetching http://myhostname/myurl.html
+ 
+ redirectCount=0
+ 
+ 
+ Any comments/feedback welcome!
+ 
+ 
+ 

Mime
View raw message