mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Is there any way I could use to reduce the cost of Mapper and Reducer setup and cleanup in a iterative MapReduce chain?
Date Thu, 05 May 2011 14:42:42 GMT
Stanley,

The short answer is that this is a real problem.

Try this:

*Spark: Cluster Computing with Working Sets.* Matei Zaharia, Mosharaf
Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, in HotCloud 2010,
June 2010.

Or this http://www.iterativemapreduce.org/

http://code.google.com/p/haloop/

You may be interested in experimenting with MapReduce 2.0.  THat allows more
flexibility in execution model:

http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/

Systems like FlumeJava (and my open source, incomplete clone Plume) may help
with flexibility:

http://www.deepdyve.com/lp/association-for-computing-machinery/flumejava-easy-efficient-data-parallel-pipelines-xtUvap2t1I

https://github.com/tdunning/Plume/commit/a5a10feaa068b33b1d929c332e4614aba50dd39a


On Thu, May 5, 2011 at 2:16 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:

> Dear All,
>
> Our team is trying to implement a parallelized LDA with Gibbs Sampling. We
> are using the algorithm mentioned by plda, http://code.google.com/p/plda/
>
> The problem is that by the Map-Reduce method the paper mentioned. We need
> to
> run a MapReduce job every gibbs sampling iteration, and normally, it will
> use 1000 - 2000 iterations per our test with our data to converge. But as
> we
> know, there is a cost to re-create the mapper/reducer, and cleanup the
> mapper/reducer in every iteration. It will take about 40 seconds on our
> cluster per our test, and 1000 iteration means almost 12 hours.
>
> I am wondering if there is a way to reduce the cost of Mapper/Reducer
> setup/cleanup, since I prefer to have all the mappers to read the same
> local
> data and update the local data in a mapper process. All the other update it
> need comes from the reducer which is a pretty small data compare to the
> whole dataset.
>
> Is there any approach I could try(including change part of hadoop's source
> code.)?
>
>
> Best wishes,
> Stanley Xu
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message