In case of task failures,does spark clear the persisted RDD (StorageLevel.MEMORY_ONLY_SER) and recompute them again when the task is attempted to start from beginning. Or will the cached RDD be appended.
How does spark checks whether the RDD has been cached and skips the caching step for a particular task.
I am not pretty sure, but:
- if RDD persisted in memory then on task fail executor JVM process fails too, so the memory is released
- if RDD persisted on disk then on task fail Spark shutdown hook just wipes temp files