spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tsai Li Ming <mailingl...@ltsai.com>
Subject RDD memory and storage level option
Date Thu, 20 Nov 2014 12:12:06 GMT
Hi,

This is on version 1.1.0.

I’m did a simple test on MEMORY_AND_DISK storage level.

> var file = sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK)
> file.count()

The file is 1.5GB and there is only 1 worker. I have requested for 1GB of worker memory per
node:
                                                                                         
                                    
             ID               Name     Cores Memory per Node   Submitted Time    User  State
 Duration                        
   app-20141120193912-0002 Spark shell 64    1024.0 MB       2014/11/20 19:39:12 root RUNNING
6.0 min                         


After doing a simple count, the job web ui indicates the entire file is saved on disk?

               RDD Name                Storage Level         Cached         Fraction     
Size in     Size in     Size on     
                                                           Partitions        Cached      
Memory      Tachyon       Disk      
   file:///path/to/file.txt Disk Serialized 1x             46               100%         
 0.0 B       0.0 B        1476.5 MB    
                                     Replicated                                          
                                    
                                                 
1. Shouldn’t some partitions be saved into memory? 




2. If I run with MEMORY_ONLY option, I can save some partitions into memory but there are
still space left according to the executor page
220.6 MB / 530.3MB and it did not fully use up them? Each partition is about 73MB.

              RDD Name                  Storage Level          Cached        Fraction    
 Size in     Size in    Size on    
                                                              Partitions       Cached    
  Memory      Tachyon      Disk     
   file:///path/to/file.txt Memory Deserialized              3                7%         
  220.6 MB    0.0 B        0.0 B      
                                     1x Replicated                                       
                                    
                                              
    Executor    Address      RDD     Memory    Disk   Active   Failed   Complete    Total
  Task   Input  Shuffle  Shuffle    
       ID                   Blocks    Used     Used   Tasks    Tasks      Tasks     Tasks
  Time            Read    Write     
                                    220.6 MB                                             
        1457.4MB                      
   0          foo.co:48660 3        / 530.3   0.0 B  0        0        46          46    
 14.2 m         0.0 B    0.0 B      
                                    MB        

14/11/20 19:53:22 INFO BlockManagerInfo: Added rdd_1_22 in memory on foo.co:48660 (size: 73.6
MB, free: 309.6 MB)
14/11/20 19:53:22 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) in 29833 ms
on foo.co (43/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33) in 31502 ms
on foo.co (44/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 24.0 in stage 0.0 (TID 24) in 31651 ms
on foo.co (45/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 31782 ms
on foo.co (46/46)
14/11/20 19:53:24 INFO DAGScheduler: Stage 0 (count at <console>:16) finished in 31.818
s
14/11/20 19:53:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed,
from pool 
14/11/20 19:53:24 INFO SparkContext: Job finished: count at <console>:16, took 31.926585742
s
res0: Long = 10000000

Is this correct?



3. I can’t seem to work out the math to derive 530MB that is made available in the executor?
1024MB * memoryFraction(0.6) = 614.4

Thanks!





---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message