spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei Tan <>
Subject Re: rdd.cache() is not faster?
Date Wed, 18 Jun 2014 14:40:15 GMT
Hi Gaurav, thanks for your pointer. The observation in the link is (at 
least qualitatively) similar to mine.

Now the question is, if I do have big data (40GB, cached size is 60GB) and 
even big memory (192 GB), I cannot benefit from RDD cache, and should 
persist on disk and leverage filesystem cache?

I will try more workers so that each JVM has a smaller heap.

Best regards,

Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center

From:   Gaurav Jain <>
Date:   06/18/2014 06:30 AM
Subject:        Re: rdd.cache() is not faster?

You cannot assume that caching would always reduce the execution time,
especially if the data-set is large. It appears that if too much memory is
used for caching, then less memory is left for the actual computation
itself. There has to be a balance between the two. 

Page 33 of this thesis from KTH talks about this:


Gaurav Jain
Master's Student, D-INFK
ETH Zurich
View this message in context:

Sent from the Apache Spark User List mailing list archive at

View raw message