I'm trying to run a task where I accumulate a ~1.5GB RDD with the spark url as local. No matter what I do to this RDD, telling it to persist with StorageLevel.DISK_ONLY, or un-persisting the RDD altogether, it's always causing the JVM to run out of
I'm building the RDD in "batches". I'm building up a Java collection of 500,000 items, then using context.parallelize() on that collection; call this RDD currBatchRDD. Then, I perform an RDD.union on the previous-batch RDD (prevBatchRDD) and the parallelized
collection. Then I set prevBatchRDD to this union result, and so on. I clear this Java collection, and continue from there.
I would expect that, both locally and with an actual Spark cluster, that StorageLevel configurations would be respected for keeping RDDs on-heap or off-heap. However, my memory profile shows that the entire RDD is being collected on-heap in the local case.
Am I misunderstanding the documentation?