spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhang, Liye (JIRA)" <>
Subject [jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions
Date Wed, 29 Oct 2014 04:21:33 GMT


Zhang, Liye commented on SPARK-4094:

[SPARK-3625|] did something similar with this
issue, but currently it does not support case like this:
*rdd0 = sc.makeRDD(...)*
*rdd1 = rdd0.flatmap(...)*
In which *rdd0* would not be checkpointed.
In this JIRA, we will always traverse the whole rdd lineage for any rdd actions, until encounter
the rdds that has already been checkpointed. Since the traverse only check for the status
of rdds, the operations will not introduce much impact on the performance.

> checkpoint should still be available after rdd actions
> ------------------------------------------------------
>                 Key: SPARK-4094
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Zhang, Liye
> rdd.checkpoint() must be called before any actions on this rdd, if there is any other
actions before, checkpoint would never succeed. For the following code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. But this would not happen for RDD cache. RDD cache
would always make successfully before rdd actions no matter whether there is any actions before
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message