spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jeff saremi <jeffsar...@hotmail.com>
Subject Re: How can i remove the need for calling cache
Date Thu, 03 Aug 2017 03:47:16 GMT
thanks Vadim. yes this is a good option for us. thanks

________________________________
From: Vadim Semenov <vadim.semenov@datadoghq.com>
Sent: Wednesday, August 2, 2017 6:24:40 PM
To: Suzen, Mehmet
Cc: jeff saremi; user@spark.apache.org
Subject: Re: How can i remove the need for calling cache

So if you just save an RDD to HDFS via 'saveAsSequenceFile', you would have to create a new
RDD that reads that data, this way you'll avoid recomputing the RDD but may lose time on saving/loading.

Exactly same thing happens in 'checkpoint', 'checkpoint' is just a convenient method that
gives you the same RDD back, basically.

However, if your job fails, there's no way to run a new job using already 'checkpointed' data
from a previous failed run. That's where having a custom check pointer helps.

Another note: you can not delete "checkpoint"ed data in the same job, you need to delete it
somehow else.

BTW, have you tried '.persist(StorageLevel.DISK_ONLY)'? It caches data to local disk, making
more space in JVM and letting you to avoid hdfs.

On Wednesday, August 2, 2017, Vadim Semenov <vadim.semenov@datadoghq.com<mailto:vadim.semenov@datadoghq.com>>
wrote:
`saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it just saves data
to some destination.

`cache/persist` allow you to cache data and keep the DAG in case of some executor that holds
data goes down, so Spark would still be able to recalculate missing partitions

`localCheckpoint` allows you to sacrifice fault-tolerance and truncate the DAG, so if some
executor goes down, the job will fail, because it has already forgotten the DAG. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1551-L1610

and `checkpoint` allows you to save data to some shared storage and truncate the DAG, so if
an executor goes down, the job will be able to take missing partitions from the place where
it saved the RDD
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1533-L1549

On Wed, Aug 2, 2017 at 7:20 PM, Suzen, Mehmet <suzen@acm.org<javascript:_e(%7B%7D,'cvml','suzen@acm.org');>>
wrote:
On 3 August 2017 at 01:05, jeff saremi <jeffsaremi@hotmail.com<javascript:_e(%7B%7D,'cvml','jeffsaremi@hotmail.com');>>
wrote:
> Vadim:
>
> This is from the Mastering Spark book:
>
> "It is strongly recommended that a checkpointed RDD is persisted in memory,
> otherwise saving it on a file will require recomputation."

Is this really true? I had the impression that DAG will not be carried
out once RDD is serialized to an external file, so 'saveAsObjectFile'
saves DAG as well?


Mime
View raw message