spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nilesh Chakraborty <>
Subject Accumulable with huge accumulated value?
Date Sat, 14 Jun 2014 13:30:31 GMT
Hey all!

I have got an iterative problem. I'm trying to find something similar to
Hadoop's MultipleOutputs [1] in Spark 1.0. I need to build up a couple of
large dense vectors (may contain billions of elements - 2 billion doubles =>
at least 16GB) by adding partial vector chunks to it. This can be easily
done in hadoop by having two MultipleOutputs in the reducer. The reducer
also writes some other outputs. I have multiple reducers running in

Without MultipleOutputs I'd have to break my job into 2-3 jobs and therefore
pay a performance penalty, which seems to be the only option I'm left with
in Spark. OR, could I use Accumulable [2] for this purpose? I think not,
because even if I can define a custom Accumulable to do what I want, (a) I
wouldn't be able to use it as an RDD (like I can use the output partitions
with Hadoop in another job) in the next job/iteration directly, and (b) I
wouldn't even be able to retrieve the dense vector iteratively and my vector
would become driver-node-memory bound.

Any ideas how I can make this work for me?



View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message