spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <>
Subject Re: oome from blockmanager
Date Sat, 26 Oct 2013 23:39:43 GMT
You're precisely correct about your (N*M)/(# machines) shuffle blocks per
machine. I believe the 5.5 million data structures instead of 9.8 comes
from the fact that the shuffle was only around 50% of the way through
before it blew up.

Technically, the ShuffleMapTask should not require buffering the whole
partition (it does stream it, but compression algorithms like LZF do
significant buffering as you know). However, Spark in general takes the
stance that if a partition doesn't fit in memory, things may blow up. So if
the ShuffleMapTask itself isn't the cause of the OOM, too-large partitions
may cause a preceding or following RDD operation to do so anyway.

On Sat, Oct 26, 2013 at 3:41 PM, Stephen Haberman <> wrote:

> Thanks for all the help, guys.
> > When you do a shuffle form N map partitions to M reduce partitions,
> there are
> > N * M output blocks created and each one is tracked.
> Okay, that makes sense. I have a few questions then...
> If there are N*M output blocks, does that mean that each machine will
> (generally) be responsible for (N*M)/(number of machines) blocks, and so
> the
> BlockManager data structures would have appropriately less data with more
> machines?
> (Turns out 7,000*7,000/5=9.8 million which is in the ballpark of the
> estimated 5 million or so entries that were in BlockManager heap dump.)
> > Since you only have a few machines, you don't "need" the
> > extra partitions to add more parallelism to the reduce.
> True, I am perhaps overly cautious about favoring more partitions...
> It seems like previously, when Spark was shuffling a partition in the
> ShuffleMapTask, it buffered all the data in memory. So, even if you
> only had 5 machines, it was important to have lots of tiny slices of
> data, rather than a few big ones, to avoid OOMEs.
> ...but, I had forgotten this, but I believe that's no longer the case?
> And Spark/ShuffleMapTask can now fully stream partitions?
> ...if so, that seems like a big deal and that one of my first patches
> to have default partitioner prefer max partitions no longer makes as
> much sense, and spark.default.parallelism just became a whole lot more
> useful. As in, it should always be set. (We're not, currently.)
> I'll give this another go later tonight/tomorrow with less partitions
> and see what happens.
> Thanks again!
> - Stephen

View raw message