It's likely the Ints are getting boxed at some point along the journey (perhaps starting with parallelize()). I could definitely see boxed Ints being 7 times larger than primitive ones.

If you wanted to be very careful, you could try making an RDD[Array[Int]], where each element is simply a subset of your original array, and specifying one partition per element, effectively manually partitioning your data. I suspect you'd see the 7x overhead disappear.

Hi, all
in order to understand the memory usage about spark, i do the following test

val size = 1024*1024
val array = new Array[Int](size)

for(i <- 0 until size) {
array(i) = i

val a = sc.parallelize(array).cache() /*4MB*/

val b = a.mapPartitions{ c => {
  val d = c.toArray

  val e = new Array[Int](2*size) /*8MB*/
  val f = new Array[Int](2*size) /*8MB*/

  for(i <- 0 until 2*size) {
    e(i) = d(i % size)
    f(i) = d((i+1) % size)


when i compile and run in sbt, the estimated size of a and b is exactly 7
times larger than the real size

14/04/15 09:10:55 INFO storage.MemoryStore: Block rdd_0_0 stored as values
to memory (estimated size 28.0 MB, free 862.9 MB)
14/04/15 09:10:55 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_0_0 in memory on ubuntu.local:59962 (size: 28.0 MB, free: 862.9

14/04/15 09:10:56 INFO storage.MemoryStore: Block rdd_1_0 stored as values
to memory (estimated size 112.0 MB, free 750.9 MB)
14/04/15 09:10:56 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_1_0 in memory on ubuntu.local:59962 (size: 112.0 MB, free: 750.9

but when i try it in the spark shell, the estimated size is almost equal to
real size

14/04/15 09:23:27 INFO MemoryStore: Block rdd_0_0 stored as values to memory
(estimated size 4.2 MB, free 292.7 MB)
14/04/15 09:23:27 INFO BlockManagerMasterActor$BlockManagerInfo: Added
rdd_0_0 in memory on ubuntu.local:54071 (size: 4.2 MB, free: 292.7 MB)

14/04/15 09:27:40 INFO MemoryStore: Block rdd_1_0 stored as values to memory
(estimated size 17.0 MB, free 275.8 MB)
14/04/15 09:27:40 INFO BlockManagerMasterActor$BlockManagerInfo: Added
rdd_1_0 in memory on ubuntu.local:54071 (size: 17.0 MB, free: 275.8 MB)

who knows the reason?
i'm really confused about memory use in spark.

JVM and spark memory locate at different parts of system memory, the spark
code is executed in JVM memory, malloc operation like val e = new
Array[Int](2*size) /*8MB*/ use JVM memory. if not cached, generated RDDs are
writed back to disk, if cached, RDDs are copied to spark memory, is that

