nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "NutchHadoopSingleNodeTutorial" by OmkarReddy
Date Thu, 09 Nov 2017 12:56:18 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchHadoopSingleNodeTutorial" page has been changed by OmkarReddy:
https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial?action=diff&rev1=7&rev2=8

  
  '''1. Step: Download and install Hadoop in pseudo-distributed mode, as explained here:'''
  
-  [[http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html| Hadoop Single Node Setup]].
+  [[https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html|
Hadoop Single Node Setup]].
  
  Here, it’s important to set up ''HADOOP_HOME'' to point to the root of the hadoop installation,

  similar to ''JAVA_HOME'' it has to be set globally, so the hadoop start-up script can be
called from anywhere. 
  
  (Check this by running: ' ''echo $HADOOP_HOME'' ' in the console, which should return the
path to the root of your hadoop installation.)
  
- '''''N.B.''''' Make sure your hadoop installation is working correctly before trying to
integrate Nutch!
+ '''''N.B.''''' Make sure your hadoop installation is working correctly by running the examples
as mentioned in the link above before trying to integrate Nutch!
  
  E.g. try to connect to the jobtracker at: http://localhost:50030/. 
  
@@ -22, +22 @@

  
  '''2. Step: Download and install Nutch 1.x:'''
  
- Download a stable source version e.g. apache-nutch-1.8-src.zip from http://nutch.apache.org/downloads.html.
+ Download a stable source version e.g. apache-nutch-1.13-src.zip from http://nutch.apache.org/downloads.html.
  
- For installation of apache-nutch-1.8-src.zip:
+ For installation of apache-nutch-1.13-src.zip:
  
-  * Unzip and over the terminal cd into the freshly exracted folder ''apache-nutch-1.8''
+  * Unzip and over the terminal cd into the freshly exracted folder ''apache-nutch-1.13''
  
   * Run ‘ant runtime’ in this folder
  
  This command builds the runtime environment, where ''runtime/local'' stores all
  configuration files, libraries etc. but does not use the hadoop version, which has been
set up here (pseudo-distributed mode), but the local (standalone) non-distributed version,
that is often used for debugging and described in more detail here: 
- [[http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html#Local| Hadoop Standalone Setup]].
+ [[https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation|
Hadoop Standalone Setup]].
  
  
  However, the nutch-job jar used for hadoop in pseudo-distributed mode lives in 
  ''runtime/deploy/''. 
  As a consequence, any modification to the configuration files in ''$NUTCH/conf'' (the configuration
directory at the root) require
- a re-build with ‘ant’ to make sure the changes become part of the nutch-job jar as well.
  
+ a re-build with ‘ant’ to make sure the changes become part of the nutch-job jar as well.
+ 
+ '''''N.B.''''' Make sure that the property mapreduce.framework.name in etc/hadoop/mapred-site.xml
is set as mentioned in the hadoop documentation above.    
  
  See: NutchTutorial on how to set up a specific configuration and run a crawl. 
  

Mime
View raw message