nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Update of "NutchHadoopTutorial" by ChiaHungLin
Date Mon, 21 Feb 2011 07:16:37 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchHadoopTutorial" page has been changed by ChiaHungLin.


  scp -r /nutch/search/* nutch@computer:/nutch/search
+ '''The main point is to copy nutch-* (under $nutch_home/conf) and crawl-urlfilter.txt files
to $hadoop_home/conf folder so that hadoop cluster can pick up those configuration when startup.
Otherwise hadoop cluster will complain with messages e.g. "0 records selected for fetching,
exiting .. URLs to fetch - check your seed list and URL filters."'''
  Do this for every computer you want to use as a slave node.  Then edit the slaves file,
adding each slave node name to the file, one per line.  You will also want to edit the hadoop-site.xml
file and change the values for the map and reduce task numbers, making this a multiple of
the number of machines you have.  For our system which has 6 data nodes I put in 32 as the
number of tasks.  The replication property can also be changed at this time.  A good starting
value is something like 2 or 3. *(see Note at bottom about possibly having to clear filesystem
of new datanodes).   Once this is done you should be able to startup all of the nodes.
  To start all of the nodes we use the exact same command as before:

View raw message