spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Custom persist or cache of RDD?
Date Mon, 10 Nov 2014 22:15:34 GMT
Well you can always create C by loading B from disk, and likewise for
E / D. No need for any custom procedure.

On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang <bewang.tech@gmail.com> wrote:
> When I have a multi-step process flow like this:
>
> A -> B -> C -> D -> E -> F
>
> I need to store B and D's results into parquet files
>
> B.saveAsParquetFile
> D.saveAsParquetFile
>
> If I don't cache/persist any step, spark might recompute from A,B,C,D and E
> if something is wrong in F.
>
> Of course, I'd better cache all steps if I have enough memory to avoid this
> re-computation, or persist result to disk. But persisting B and D seems
> duplicate with saving B and D as parquet files.
>
> I'm wondering if spark can restore B and D from the parquet files using a
> customized persist and restore procedure?
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message