nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by LewisJohnMcgibbney
Date Fri, 02 Sep 2011 19:18:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "RunningNutchAndSolr" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=68&rev2=69

  {{{
  http://nutch.apache.org/
  }}}
+ * Edit the file conf/regex-urlfilter.txt and replace 
+ {{{
+ # accept anything else
+ +.  
+ }}}
+ 
+ with a regular expression matching the domain you wish to crawl. For example, if you wished
to limit the crawl to the nutch.apache.org domain, the line should read:
+ 
+ {{{
+  +^http://([a-z0-9]*\.)*nutch.apache.org/ 
+ }}} 
+ 
+ This will include any url in the domain nutch.apache.org.
   * Run the following command:
  {{{
  bin/nutch crawl urls -dir crawl -depth 3 -topN 5
@@ -102, +115 @@

  <field name="content" type="text" stored="true" indexed="true"/>
  }}}
  
- '''This tutorial was originally constructed and posted by 'waycool' on the user lists. It
has been edited slightly for integration into the Apache Nutch project.'''
- 

Mime
View raw message