nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Feng Zhou <fengz...@gmail.com>
Subject Re: MapReduce in Nutch
Date Tue, 29 Mar 2005 01:20:06 GMT
A question about the fetching MapReduce process: Is it possible that
some segments will happen to be slower than others and thus will
prevent the whole job from finishing? It seems that the problem will
probably get worse with more fetch nodes, which is what we're aiming
at.

What about running one fetcher on each node 24/7? Each fetcher would
take segments from a global queue. Other parts of the system do not
have to wait untill the to-fetch queue is depleted before doing the DB
update and new segment generation. So basically adding a queue will
allow pipelining of the time consuming work, namely fetching, db
update and segment generation. And we will not end up waiting for one
or two fetchers to finish their job.

- Feng Zhou
Grad Student, CS, UC Berkeley

On Mon, 28 Mar 2005 11:36:47 -0800, Doug Cutting <cutting@nutch.org> wrote:
> A few weeks ago I drafted the attached document, discussing how
> MapReduce might be used in Nutch.  This is an incomplete, exploratory
> document, not a final design.  Most of Nutch's file formats are altered.
>   Every operation is implemented with MapReduce.  To run things on a
> single machine we can automatically start a job tracker one or more task
> trackers, all running in the same JVM.  Hopefully this will not be much
> slower than the current implementation running on a single machine.
> 
> Comments?
> 
> Doug

Mime
View raw message