nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "NutchTutorial" by SebastianNagel
Date Wed, 15 Aug 2018 13:41:54 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchTutorial" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=91&rev2=92

Comment:
After release of 1.15: remove -Dsolr.server.url=... which has no effect now; fix passing <Seed
Dir>

  == Using the crawl script ==
  If you have followed the section above on how the crawling can be done step by step, you
might be wondering how a bash script can be written to automate all the process described
above.
  
- Nutch developers have written one for you :), and it is available at [[bin/crawl]].
+ Nutch developers have written one for you :), and it is available at [[bin/crawl]]. Here
the most common options and parameters:
  
  {{{
-      Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num
Rounds>
+      Usage: crawl [-i|--index] [-D "key=value"] [-s <Seed Dir>] <Crawl Dir>
<Num Rounds>
  	-i|--index	Indexes crawl results into a configured indexer
- 	-D		A Java property to pass to Nutch calls
+ 	-D...		A Java property to pass to Nutch calls
- 	Seed Dir	Directory in which to look for a seeds file
+ 	-s <Seed Dir>	Directory in which to look for a seeds file
- 	Crawl Dir	Directory where the crawl/link/segments dirs are saved
+ 	<Crawl Dir>	Directory where the crawl/link/segments dirs are saved
- 	Num Rounds	The number of rounds to run this crawl for
+ 	<Num Rounds>	The number of rounds to run this crawl for
-      Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch urls/ TestCrawl/
 2
+      Example: bin/crawl -i -s urls/ TestCrawl/  2
  }}}
  The crawl script has lot of parameters set, and you can modify the parameters to your needs.
It would be ideal to understand the parameters before setting up big crawls.
  

Mime
View raw message