From Doug Cutting <>
Subject Re: To mapred or not
Date Thu, 01 Sep 2005 16:36:19 GMT
Kelvin Tan wrote:
> Seeing mapred is about to be folded into trunk, 3 questions:
> 1. Any benchmarks/estimates on when the scalability of map-reduce surpasses its overhead/complexity?
e.g. with > 10 reduce workers..

I think that with as few as two boxes it will outperform uniprocessor 
Nutch.  This will not be true for very small collections, since the 
overhead of starting JVMs can dominate those.  (MapReduce runs each task 
in a separate JVM, for robustness.)

> 2. Will there be an option of a plain vanilla single-box Nutch crawler vs a map-reduce

Yes, there already is.  By default MapReduce runs on a single box in a 
single JVM.  To run on multiple boxes in multiple JVMs one must alter 
the default configuration (to name the jobtracker server) and start the 
jobtracker daemon and one or more tasktracker daemons.  There are shell 
scripts to assist with the management of daemons.

> 3. What are the options for users who don't want to jump onboard map-red? Will pre-mapred
be actively maintained?

The MapReduce versons of Nutch tools (inject, generate, fetch, etc.) are 
not a large amount of code.  One could easily build compatible 
non-MapReduce versions of these tools.  But then we'd be maintaining two 
versions, so we should avoid this as much as possible.  However, if some 
applications need a very different control flow, then that might warrant 
this.  For example, one might write a crawler that combines a number of 
these tools in a single process, e.g., using RDBMS to keep track of urls 
and links while crawling, updating the database in real-time as URLs are 
fetched.  Such an architecture would not be as scalable, but might excel 
in other ways.

It would be worth considering which features of your constrained crawler 
  could be cast as improvements to Nutch's existing tools (e.g., more 
seed url formats, more output formats, http 1.1, custom scopes, etc.) 
and which require a different control flow (online fetchlist building?). 
  In some cases (e.g., fetch prioritization) perhaps a new Plugin should 
be added to Nutch.

In the mapred branch the webdb has been decomposed into a crawldb and a 
linkdb.  The crawldb is much smaller and simpler than the former webdb, 
containing only an entry for each known URL.  This makes updates much 
faster while crawling.  The linkdb contains only the link graph, and 
needs only be updated prior to indexing, not with each step while 
crawling as before.


