nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by LewisJohnMcgibbney
Date Fri, 02 Sep 2011 19:46:07 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "RunningNutchAndSolr" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=72&rev2=73

  The parser also takes a few minutes, as it must parse the full file. Finally, we initialize
the crawl db with the selected urls.
  
  {{{ 
- bin/nutch inject crawl/crawldb dmoz 
+ bin/nutch inject crawldb dmoz 
  }}}
  
  Now we have a web database with around 1000 as-yet unfetched URLs in it.
@@ -137, +137 @@

  ===== Option 2.  Bootstrapping from an initial seed list. =====
  This option shadows the creation of the seed list as covered [[#3. Crawl your first website|here]].
  
+ {{{ 
- {{{ bin/nutch inject crawldb urls }}}
+ bin/nutch inject crawldb urls 
+ }}}
  
  ==== Step-by-Step: Fetching ====
  To fetch, we first generate a fetch list from the database:
  
+ {{{ 
- {{{ bin/nutch generate crawldb segments }}}
+ bin/nutch generate crawldb segments 
+ }}}
  
  This generates a fetch list for all of the pages due to be fetched. The fetch list is placed
in a newly created segment directory. The segment directory is named by the time it's created.
We save the name of this segment in the shell variable {{{s1}}}:
  
@@ -152, +156 @@

  }}}
  Now we run the fetcher on this segment with:
  
+ {{{ 
- {{{ bin/nutch fetch $s1 }}}
+ bin/nutch fetch $s1 
+ }}}
  
  When this is complete, we update the database with the results of the fetch:
  
+ {{{ 
- {{{ bin/nutch updatedb crawldb $s1 }}}
+ bin/nutch updatedb crawldb $s1 
+ }}}
  
  Now the database contains both updated entries for all initial pages as well as new entries
that correspond to newly discovered pages linked from the initial set.
  
  Then we parse the entries:
  
+ {{{ 
- {{{ bin/nutch parse $1 }}}
+ bin/nutch parse $1 
+ }}}
  
  Now we generate and fetch a new segment containing the top-scoring 1000 pages:
  
@@ -191, +201 @@

  ==== Step-by-Step: Invertlinks ====
  Before indexing we first invert all of the links, so that we may index incoming anchor text
with the pages.
  
+ {{{ 
- {{{ bin/nutch invertlinks linkdb -dir segments }}}
+ bin/nutch invertlinks linkdb -dir segments 
+ }}}
  
  We are now ready to search with Apache Solr. 
  

Mime
View raw message