nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "bin/nutch generate" by SebastianNagel
Date Thu, 19 May 2016 12:41:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "bin/nutch generate" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/bin/nutch%20generate?action=diff&rev1=4&rev2=5

Comment:
Add information about scope (per segment / over all segments) of -topN and generate.max.count
when multiple segments are generated

  
  '''<segments_dir>''': Path to the location of our segments directory where the Fetcher
Segments are created.
  
- '''[-force]''': This arguement will force an update even if there appears to be a lock.
/!\ : CAUTION: advised /!\
+ '''[-force]''': This argument will force an update even if there appears to be a lock. /!\
: CAUTION: advised /!\
  
  '''[-topN N]''': Where N is the number of top URLs to be selected. Normally, the "generate"
command prepares a fetchlist out of all unfetched pages, or the ones where fetch interval
already expired. But if you use -topN, then instead of all unfetched urls you only get N urls
with the highest score - potentially the most interesting ones, which should be prioritized
in fetching.
  
@@ -27, +27 @@

  
  '''[-noNorm]''': The exact same applies for normalisation parameter as does for the filtering
option above.
  
- '''[-maxNumSegments num]''': The (maximum) number of segments to be generated. Default:
1
+ '''[-maxNumSegments num]''': The (maximum) number of segments to be generated. Default:
1 -- Note: if multiple segments are generated, the limit -topN applies to the total number
of URLs for all segments taken together, while generate.max.count is applied to every generated
segment individually. 
  
  ==== Configuration Files ====
   hadoop-default.xml<<BR>>

Mime
View raw message