spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From wxhsdp <>
Subject storage.MemoryStore estimated size 7 times larger than real
Date Tue, 15 Apr 2014 02:07:24 GMT
Hi, all
in order to understand the memory usage about spark, i do the following test

val size = 1024*1024
val array = new Array[Int](size)

for(i <- 0 until size) {
array(i) = i

val a = sc.parallelize(array).cache() /*4MB*/

val b = a.mapPartitions{ c => {
  val d = c.toArray

  val e = new Array[Int](2*size) /*8MB*/
  val f = new Array[Int](2*size) /*8MB*/

  for(i <- 0 until 2*size) {
    e(i) = d(i % size)
    f(i) = d((i+1) % size)


when i compile and run in sbt, the estimated size of a and b is exactly 7
times larger than the real size

14/04/15 09:10:55 INFO storage.MemoryStore: Block rdd_0_0 stored as values
to memory (estimated size 28.0 MB, free 862.9 MB)
14/04/15 09:10:55 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_0_0 in memory on ubuntu.local:59962 (size: 28.0 MB, free: 862.9

14/04/15 09:10:56 INFO storage.MemoryStore: Block rdd_1_0 stored as values
to memory (estimated size 112.0 MB, free 750.9 MB)
14/04/15 09:10:56 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_1_0 in memory on ubuntu.local:59962 (size: 112.0 MB, free: 750.9

but when i try it in the spark shell, the estimated size is almost equal to
real size

14/04/15 09:23:27 INFO MemoryStore: Block rdd_0_0 stored as values to memory
(estimated size 4.2 MB, free 292.7 MB)
14/04/15 09:23:27 INFO BlockManagerMasterActor$BlockManagerInfo: Added
rdd_0_0 in memory on ubuntu.local:54071 (size: 4.2 MB, free: 292.7 MB)

14/04/15 09:27:40 INFO MemoryStore: Block rdd_1_0 stored as values to memory
(estimated size 17.0 MB, free 275.8 MB)
14/04/15 09:27:40 INFO BlockManagerMasterActor$BlockManagerInfo: Added
rdd_1_0 in memory on ubuntu.local:54071 (size: 17.0 MB, free: 275.8 MB)

who knows the reason?
i'm really confused about memory use in spark. 

JVM and spark memory locate at different parts of system memory, the spark
code is executed in JVM memory, malloc operation like val e = new
Array[Int](2*size) /*8MB*/ use JVM memory. if not cached, generated RDDs are
writed back to disk, if cached, RDDs are copied to spark memory, is that

View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message