nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney
Date Fri, 02 Sep 2011 20:10:26 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=30&rev2=31

+ = Nutch and Hadoop Tutorial =
- = How to Setup Nutch (V1.1) and Hadoop =
- --------------------------------------------------------------------------------
  
- Note: Originally this ([[NutchHadoopTutorial0.8]]) was written for version 0.8 of Nutch.
This has been edited by people other than the original author so statements like "I did this"
or "I recommend that" are slightly misleading. 
+ As of the official Nutch 1.3 release the source code architecture has been greatly simplified
to allow us to run Nutch in one of two modes; namely '''local''' and '''deploy'''. By default,
Nutch no longer comes with a Hadoop distribution, however when run in local mode e.g. running
Nutch in a single process on one machine, then we use Hadoop as a dependency. This may suit
you fine if you have a small site to crawl and index, but most people choose Nutch because
of its capability to run on in deploy mode, within a Hadoop cluster. This gives you the benefit
of a distributed file system (HDFS) and MapReduce processing style.  The purpose of this tutorial
is to provide a step-by-step method to get Nutch running with the Hadoop file system on multiple
machines, including being able to both crawl and search across multiple machines.  
  
- --------------------------------------------------------------------------------
+ This document does not go into the Nutch or Hadoop architecture, resources relating to these
topics can be found [[FrontPage#Nutch Development|here]]. It only tells how to get the systems
up and running. There are also relevant resources at the end of this tutorial if you want
to know more about the architecture of Nutch and Hadoop.
  
+ '''N.B.''' Prerequsites for this tutorial are both the [[NutchTutorial|Nutch Tutorial]]
and the [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]]. 
- By default, out of the box, Nutch runs in a single process on one machine. This may suit
you fine if you have a small site to crawl and index, but most people choose Nutch because
of its capability to run on a Hadoop cluster. This gives you the benefit of a distributed
file system (HDFS) and MapReduce processing style.  The purpose of this tutorial is to provide
a step-by-step method to get Nutch running with Hadoop file system on multiple machines, including
being able to both index (crawl) and search across multiple machines.  
- 
- This document does not go into the Nutch or Hadoop architecture.  It only tells how to get
the systems up and running.  At the end of the tutorial though I will point you to relevant
resources if you want to know more about the architecture of Nutch and Hadoop.
- 
- The tutorial comes in two phases. Firstly we get Hadoop running on a single machine (a bit
of a simple cluster!) and then more than one machine.
  
  === Assumptions ===
  

Mime
View raw message