mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanley Xu <>
Subject Is there any way I could use to reduce the cost of Mapper and Reducer setup and cleanup in a iterative MapReduce chain?
Date Thu, 05 May 2011 09:16:49 GMT
Dear All,

Our team is trying to implement a parallelized LDA with Gibbs Sampling. We
are using the algorithm mentioned by plda,

The problem is that by the Map-Reduce method the paper mentioned. We need to
run a MapReduce job every gibbs sampling iteration, and normally, it will
use 1000 - 2000 iterations per our test with our data to converge. But as we
know, there is a cost to re-create the mapper/reducer, and cleanup the
mapper/reducer in every iteration. It will take about 40 seconds on our
cluster per our test, and 1000 iteration means almost 12 hours.

I am wondering if there is a way to reduce the cost of Mapper/Reducer
setup/cleanup, since I prefer to have all the mappers to read the same local
data and update the local data in a mapper process. All the other update it
need comes from the reducer which is a pretty small data compare to the
whole dataset.

Is there any approach I could try(including change part of hadoop's source

Best wishes,
Stanley Xu

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message