spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayuresh Kunjir <>
Subject Re: Bagel caching issues
Date Sun, 01 Dec 2013 02:58:24 GMT
I tried passing DISK_ONLY storage level to Bagel's run method. It's running
without any error (so far) but is too slow. I am attaching details for a
stage corresponding to second iteration of my algorithm. (foreach at
It's been running for more than 35 minutes. I am noticing very high GC time
for some tasks. Listing below the setup parameters.

#nodes = 16
RDD storage fraction = 0.5
degree of parallelism = 192 (16 nodes * 4 cores each * 3)
Serializer = Kryo
Vertex data size after serialization = ~12G (probably too high, but it's
the bare minimum required for the algorithm.)

I would be grateful if you could suggest some further optimizations or
point out reasons why/if Bagel is not suitable for this data size. I need
to further scale my cluster and not feeling confident at all looking at

Thanks and regards,

On Sat, Nov 30, 2013 at 3:07 PM, Mayuresh Kunjir

> Hi Spark users,
> I am running a pagerank-style algorithm on Bagel and bumping into "out of
> memory" issues with that.
> Referring to the following table, rdd_120 is the rdd of vertices,
> serialized and compressed in memory. On each iteration, Bagel deserializes
> the compressed rdd. e.g. rdd_126 shows the uncompressed version of rdd_120
> persisted in memory and disk. As iterations keep piling on, the cached
> partitions start getting evicted. The moment a rdd_120 partition gets
> evicted, it necessitates a recomputations and the performance goes for a
> toss. Although we don't need uncompressed rdds from previous iterations,
> they are the last ones to get evicted thanks to LRU policy.
> Should I make Bagel use DISK_ONLY persistence? How much of a performance
> hit would that be? Or maybe there is a better solution here.
> Storage
>  RDD NameStorage Level Cached PartitionsFraction Cached Size in MemorySize
> on Disk rdd_83<>Memory
Serialized1x Replicated2312%83.7 MB0.0 B
> rdd_95<>Memory
Serialized1x Replicated23
> 12% 2.5 MB 0.0 B rdd_120<>Memory
Serialized1x Replicated2513%761.1 MB0.0 B
> rdd_126<>Disk
Memory Deserialized 1x Replicated192
> 100% 77.9 GB 1016.5 MB rdd_134<>Disk
Memory Deserialized 1x Replicated18596%60.8 GB475.4 MB
> Thanks and regards,
> ~Mayuresh

View raw message