spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tomsheep.cn@gmail.com" <tomsheep...@gmail.com>
Subject About StorageLevel
Date Thu, 26 Jun 2014 12:36:25 GMT
Hi all,

I have a newbie question about StorageLevel of spark. I came up with these sentences in spark
documents:

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that
way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast
as possible.

And

Spark automatically monitors cache usage on each node and drops out old data partitions in
a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of
waiting for it to fall out of the cache, use the RDD.unpersist() method.
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence 

But I found the default storageLevel is NONE in source code, and if I never call 'persist(somelevel)',
that value will always be NONE. The 'iterator' method goes to

final def iterator(split: Partition, context: TaskContext): Iterator[T] = { 
    if (storageLevel != StorageLevel.NONE) { 
        SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel) 
    } else { 
        computeOrReadCheckpoint(split, context) 
    } 
}     
Is that to say, the rdds are cached in memory (or somewhere else) if and only if the 'persist'
or 'cache' method is called explicitly,
otherwise they will be re-computed every time even in an iterative situation?
It made me confused becase I had a first impression that spark is super-fast because it prefers
to store intermediate results in memory automatically.

Forgive me if I asked a stupid question.

Regards,
Kang Liu
Mime
View raw message