nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: MapReduce in Nutch
Date Tue, 29 Mar 2005 17:39:46 GMT
Feng Zhou wrote:
> A question about the fetching MapReduce process: Is it possible that
> some segments will happen to be slower than others and thus will
> prevent the whole job from finishing? It seems that the problem will
> probably get worse with more fetch nodes, which is what we're aiming
> at.

This is generally a concern with MapReduce, that some tasks may take 
longer.  Solutions are:

1. Use more map tasks.  If you have a pool of 100 task tracker machines, 
then splitting fetching into 1000 map tasks would better guarantee that 
no single task takes too long.  It would help to prioritize these tasks. 
  The job tracker executes tasks in the order they are generated from 
the splitter. So one could rank the tasks by the number of hosts, so 
that tasks with many urls for few hosts are issued before those for few 
urls from lots of hosts, since politeness may force the former to be slow.'

2. Run multiple MapReduce jobs at once.  For example, one might run a 
job that's performing link analysis or db update at the same time as 
fetching, or even run multiple fetch jobs at once.  A single, straggler 
fetch task shouldn't keep other work from being done.

3. Limit the amount of time a fetcher task runs.  Stop fetching after a 
certain amount of time.

We should probably implement all of these.

> What about running one fetcher on each node 24/7? Each fetcher would
> take segments from a global queue. Other parts of the system do not
> have to wait untill the to-fetch queue is depleted before doing the DB
> update and new segment generation. So basically adding a queue will
> allow pipelining of the time consuming work, namely fetching, db
> update and segment generation. And we will not end up waiting for one
> or two fetchers to finish their job.

One could do that, but I think the same effect can be achieved with (2) 
above.  Consider adding a db page status named FETCHING.  When a fetch 
list is created (step 4 of the document I posted) one avoids generating 
urls whose status is FETCHING, and rewrites db entries for generated 
pages to FETCHING.  This status is overwritten when the db is updated 
with fetcher output.  If a url's status is FETCHING for more than a 
specified amount of time (e.g., one week) then it will be re-generated 

Here one bootstraps by generating multiple initial segments, and all 
urls are in the db marked FETCHING.  All segments submitted as fetcher 
jobs in parallel.  As each fetcher job completes, the database is 
updated with its output and one or more new fetcher jobs are generated.

The page db file is a queue.  Updates to the queue are batched, which 
optimizes i/o.  Each url added to the queue must be checked against the 
queue before it can generated.  Representing the queue with a B-tree 
would permit incremental updates, but would require log(N) disk seeks 
per outlink, which is much to slow.  Thus to be efficient, the queue 
must buffer new urls and periodically sort and merge them with the set 
of known urls.  This is just a page db update.

Is this convincing?


View raw message