spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benyi Wang <bewang.t...@gmail.com>
Subject Custom persist or cache of RDD?
Date Mon, 10 Nov 2014 19:33:20 GMT
When I have a multi-step process flow like this:

A -> B -> C -> D -> E -> F

I need to store B and D's results into parquet files

B.saveAsParquetFile
D.saveAsParquetFile

If I don't cache/persist any step, spark might recompute from A,B,C,D and E
if something is wrong in F.

Of course, I'd better cache all steps if I have enough memory to avoid this
re-computation, or persist result to disk. But persisting B and D seems
duplicate with saving B and D as parquet files.

I'm wondering if spark can restore B and D from the parquet files using a
customized persist and restore procedure?

Mime
View raw message