spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <>
Subject Re: oome from blockmanager
Date Sat, 26 Oct 2013 19:07:03 GMT
Hey Stephen,

Just wondering, how many reducers are you using in this shuffle? By 7,000
partitions, I'm assuming you mean the map side of the shuffle. What about
the reduce side?

- Patrick

On Sat, Oct 26, 2013 at 11:43 AM, Stephen Haberman <> wrote:

> Hi,
> By dropping spark.shuffle.file.buffer.kb to 10k and using Snappy
> (thanks, Aaron), the job I'm trying to run is no longer OOMEing because
> of 300k LZF buffers taking up 4g of RAM.
> it's OOMEing because BlockManager is taking ~3.5gb of RAM
> (which is ~90% of the available heap).
> Specifically, it's two ConcurrentHashMaps:
> * BlockManager.blockInfo has ~1gb retained, AFAICT from ~5.5 million
>   entries of (ShuffleBlockId, (BlockInfo, Long))
> * BlockManager's DiskBlockManager.blockToFileSegmentMap has ~2.3gb
>   retained, AFAICT from about the same ~5.5 million entries of
>   (ShuffleBlockId, (FileSegment, Long)).
> The job stalls about 3,000 tasks through a 7,000-partition shuffle that
> is loading ~500gb from S3 on 5 m1.large (4gb heap) machines. The job
> did a few smaller ~50-partition shuffles before this larger one, but
> nothing crazy. It's an on-demand/EMR cluster, in standalone mode.
> Both of these maps are TimeStampedHashMaps, which kind of makes me
> shudder, but we have the cleaner disabled which AFAIK is what we want,
> because we aren't running long-running streaming jobs. And AFAIU if the
> hash map did get cleaned up mid-shuffle, lookups would just start
> failing (which was actually happening for this job on Spark 0.7 and is
> what prompted us to get around to trying Spark 0.8).
> So, I haven't really figured out BlockManager yet--any hints on what we
> could do here? More machines? Should there really be this many entries
> in it for a shuffle of this size?
> I know 5 machines/4gb of RAM isn't a lot, and I could use more if
> needed, but I just expected the job to go slower, not OOME.
> Also, I technically have a heap dump from a m1.xlarge (~15gb of RAM)
> cluster that also OOMEd on the same job, but I can't open the file on
> my laptop, so I can't tell if it was OOMEing for this issue or another
> one (it was not using snappy, but using 10kb file buffers, so I'm
> interested to see what happened to it.)
> - Stephen

View raw message