nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "bin/crawl" by SebastianNagel
Date Wed, 15 Aug 2018 13:38:59 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "bin/crawl" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/bin/crawl?action=diff&rev1=2&rev2=3

Comment:
Update to recent version (1.15) of bin/crawl

  = Usage =
  == Nutch 1.X ==
  {{{
-      Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num
Rounds>
+ Usage: crawl [options] <crawl_dir> <num_rounds>
+ 
+ Arguments:
+   <crawl_dir>                           Directory where the crawl/host/link/segments
dirs are saved
+   <num_rounds>                          The number of rounds to run this crawl for
+ 
+ Options:
-         -i|--index      Indexes crawl results into a configured indexer
+   -i|--index                            Indexes crawl results into a configured indexer
-         -D              A Java property to pass to Nutch calls
+   -D                                    A Java property to pass to Nutch calls
-         Seed Dir        Directory in which to look for a seeds file
-         Crawl Dir       Directory where the crawl/link/segments dirs are saved
-         Num Rounds      The number of rounds to run this crawl for
-      Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/
 2
+   -w|--wait <NUMBER[SUFFIX]>            Time to wait before generating a new segment
when no URLs
+                                         are scheduled for fetching. Suffix can be: s for
second,
+                                         m for minute, h for hour and d for day. If no suffix
is
+                                         specified second is used by default. [default: -1]
+   -s <seed_dir>                         Path to seeds file(s)
+   -sm <sitemap_dir>                     Path to sitemap URL file(s)
+   --hostdbupdate                                Boolean flag showing if we either update
or not update hostdb for each round
+   --hostdbgenerate                      Boolean flag showing if we use hostdb in generate
or not
+   --num-slaves <num_slaves>             Number of slave nodes [default: 1]
+                                         Note: This can only be set when running in distribution
mode
+   --num-tasks <num_tasks>               Number of reducer tasks [default: 2]
+   --size-fetchlist <size_fetchlist>     Number of URLs to fetch in one iteration [default:
50000]
+   --time-limit-fetch <time_limit_fetch> Number of minutes allocated to the fetching
[default: 180]
+   --num-threads <num_threads>           Number of threads for fetching / sitemap processing
[default: 50]
+   --sitemaps-from-hostdb <frequency>    Whether and how often to process sitemaps
based on HostDB.
+                                         Supported values are:
+                                           - never [default]
+                                           - always (processing takes place in every iteration)
+                                           - once (processing only takes place in the first
iteration)
  }}}
  
  == Nutch 2.x ==

Mime
View raw message