spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <mch...@palantir.com>
Subject Out of memory building RDD on local[N]
Date Fri, 01 Nov 2013 22:51:14 GMT
Hi everyone,

I'm trying to run a task where I accumulate a ~1.5GB RDD with the spark url as local[8]. No
matter what I do to this RDD, telling it to persist with StorageLevel.DISK_ONLY, or un-persisting
the RDD altogether, it's always causing the JVM to run out of heap space.

I'm building the RDD in "batches". I'm building up a Java collection of 500,000 items, then
using context.parallelize() on that collection; call this RDD currBatchRDD. Then, I perform
an RDD.union on the previous-batch RDD (prevBatchRDD) and the parallelized collection. Then
I set prevBatchRDD to this union result, and so on. I clear this Java collection, and continue
from there.

I would expect that, both locally and with an actual Spark cluster, that StorageLevel configurations
would be respected for keeping RDDs on-heap or off-heap. However, my memory profile shows
that the entire RDD is being collected on-heap in the local case. Am I misunderstanding the
documentation?

Thanks,

-Matt Cheah

Mime
View raw message