mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Is there any way I could use to reduce the cost of Mapper and Reducer setup and cleanup in a iterative MapReduce chain?
Date Thu, 05 May 2011 14:42:42 GMT

The short answer is that this is a real problem.

Try this:

*Spark: Cluster Computing with Working Sets.* Matei Zaharia, Mosharaf
Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, in HotCloud 2010,
June 2010.

Or this

You may be interested in experimenting with MapReduce 2.0.  THat allows more
flexibility in execution model:

Systems like FlumeJava (and my open source, incomplete clone Plume) may help
with flexibility:

On Thu, May 5, 2011 at 2:16 AM, Stanley Xu <> wrote:

> Dear All,
> Our team is trying to implement a parallelized LDA with Gibbs Sampling. We
> are using the algorithm mentioned by plda,
> The problem is that by the Map-Reduce method the paper mentioned. We need
> to
> run a MapReduce job every gibbs sampling iteration, and normally, it will
> use 1000 - 2000 iterations per our test with our data to converge. But as
> we
> know, there is a cost to re-create the mapper/reducer, and cleanup the
> mapper/reducer in every iteration. It will take about 40 seconds on our
> cluster per our test, and 1000 iteration means almost 12 hours.
> I am wondering if there is a way to reduce the cost of Mapper/Reducer
> setup/cleanup, since I prefer to have all the mappers to read the same
> local
> data and update the local data in a mapper process. All the other update it
> need comes from the reducer which is a pretty small data compare to the
> whole dataset.
> Is there any approach I could try(including change part of hadoop's source
> code.)?
> Best wishes,
> Stanley Xu

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message