spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <mch...@palantir.com>
Subject Re: Out of memory building RDD on local[N]
Date Sat, 02 Nov 2013 00:43:07 GMT
So I'm doing some more reading of the ParallelCollectionRDD code.

It stores the data as a plain old list. Does this also mean that the data backing ParallelCollectionRDDs
are always stored on the driver's heap as opposed to being distributed to the cluster? Can
this behavior be overridden without explicitly saving the RDD to disk?

-Matt Cheah

From: Andrew Winings <mcheah@palantir.com<mailto:mcheah@palantir.com>>
Date: Friday, November 1, 2013 3:51 PM
To: "user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>" <user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>>
Cc: Mingyu Kim <mkim@palantir.com<mailto:mkim@palantir.com>>
Subject: Out of memory building RDD on local[N]

Hi everyone,

I'm trying to run a task where I accumulate a ~1.5GB RDD with the spark url as local[8]. No
matter what I do to this RDD, telling it to persist with StorageLevel.DISK_ONLY, or un-persisting
the RDD altogether, it's always causing the JVM to run out of heap space.

I'm building the RDD in "batches". I'm building up a Java collection of 500,000 items, then
using context.parallelize() on that collection; call this RDD currBatchRDD. Then, I perform
an RDD.union on the previous-batch RDD (prevBatchRDD) and the parallelized collection. Then
I set prevBatchRDD to this union result, and so on. I clear this Java collection, and continue
from there.

I would expect that, both locally and with an actual Spark cluster, that StorageLevel configurations
would be respected for keeping RDDs on-heap or off-heap. However, my memory profile shows
that the entire RDD is being collected on-heap in the local case. Am I misunderstanding the
documentation?

Thanks,

-Matt Cheah

Mime
View raw message