spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From innowireless TaeYun Kim <>
Subject Suggestion: rdd.compute()
Date Wed, 11 Jun 2014 03:10:38 GMT

Regarding the following scenario, Would it be nice to have an action method
named like 'compute()' that does nothing but computing/materializing the
whole partitions of an RDD?
It can also be useful for the profiling.

-----Original Message-----
From: innowireless TaeYun Kim [] 
Sent: Wednesday, June 11, 2014 11:40 AM
Subject: Question about RDD cache, unpersist, materialization


What I (seems to) know about RDD persisting API is as follows:
- cache() and persist() is not an action. It only does a marking.
- unpersist() is also not an action. It only removes a marking. But if the
rdd is already in memory, it is unloaded.

And there seems no API to forcefully materialize the RDD without requiring a
data by an action method, for example first().

So, I am faced with the following scenario.

    JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>());  // create
empty for merging
    for (int i = 0; i < 10; i++)
        JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]);
        rdd.cache();  // Since it will be used twice, cache.[i]);  //
Transform and save, rdd materializes
        rddUnion = rddUnion.union(;  // Do another
transform to T and merge by union
        rdd.unpersist();  // Now it seems not needed. (But needed actually)
    // Here, rddUnion actually materializes, and needs all 10 rdds that
already unpersisted.
    // So, rebuilding all 10 rdds will occur.

If rddUnion can be materialized before the rdd.unpersist() line and
cache()d, the rdds in the loop will not be needed on

Now what is the best strategy?
- Do not unpersist all 10 rdds in the loop.
- Materialize rddUnion in the loop by calling 'light' action API, like
- Give up and just rebuild/reload all 10 rdds when saving rddUnion.

Is there some misunderstanding?


View raw message