spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <>
Subject Re: storage.MemoryStore estimated size 7 times larger than real
Date Tue, 15 Apr 2014 16:46:56 GMT
Ah, I think I can see where your issue may be coming from. In spark-shell,
the MASTER is "local[*]", which just means it uses a pre-set number of
cores. This distinction only matters because the default number of slices
created from sc.parallelize() is based on the number of cores.

So when you run from sbt, you probably use a SparkContext with a "local"
master, which sets number of cores to 1, meaning you are doing
sc.parallelize(array, 1)

while in Spark Shell you are doing
sc.parallelize(array, 6ish?)

The difference between the two is just that the array is broken up into
more parts in the latter, so you will store blocks for rdd_0_0, rdd_0_1,
..., rdd_0_5 rather than just one (large) block. In both cases, though, I
suspect that the total size is around the same, at around 28 MB.

In my case, where I have an RDD[Array[Int]], I have 8 partitions (a number
I just chose randomly), and each one is 512 KB, so the total size is
actually 4 MB. You could do the same test with numSlices = 1, and you'd
just have a single 4 MB block.

The reason our two solutions produced different total memory values is
because of Java primitive boxing [1]. In your case, your RDD[Int] is
converted into an Array[Any] right before being stored into memory, which
causes it to be effectively an Array[java.lang.Integer] [2]. In my case,
the actual values inside the RDD are primitive arrays, so they cannot be
broken up. Spark still converts my RDD[Array[Int]] into an Array[Any], but
"Array[Int]" is already an Any, so there's no memory impact here.


On Tue, Apr 15, 2014 at 3:58 AM, wxhsdp <> wrote:

> sorry, davidosn, i don't catch the point. what's the essential difference
> between our codes?
> /*my code*/
> val array = new Array[Int](size)
> val a = sc.parallelize(array).cache() /*4MB*/
> /*your code*/
> val numSlices = 8
> val arr = Array.fill[Array[Int]](numSlices) { new Array[Int](size /
> numSlices) }
> val rdd = sc.parallelize(arr, numSlices).cache()
> i'm in local mode, with only one partitions, it's just an RDD of one
> partition with the type RDD[Int]
> your RDD have 8 partitions with the type RDD[Array[Int]], do that matter?
> my question is why the memory usage is 7x in sbt, but right in spark shell?
> as to the following question, i made a mistake, sorry
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at

View raw message