spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean Georges Perrin <...@jgp.net>
Subject Re: eager? in dataframe's checkpoint
Date Thu, 02 Feb 2017 23:36:44 GMT
i wrote this piece based on all that, hopefully it will help:
http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ <http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/>

> On Jan 31, 2017, at 4:18 PM, Burak Yavuz <brkyvz@gmail.com> wrote:
> 
> Hi Koert,
> 
> When eager is true, we return you a new DataFrame that depends on the files written out
to the checkpoint directory.
> All previous operations on the checkpointed DataFrame are gone forever. You basically
start fresh. AFAIK, when eager is true, the method will not return until the DataFrame is
completely checkpointed. If you look at the RDD.checkpoint implementation, the checkpoint
location is updated synchronously therefore during the count, `isCheckpointed` will be true.
> 
> Best,
> Burak
> 
> On Tue, Jan 31, 2017 at 12:52 PM, Koert Kuipers <koert@tresata.com <mailto:koert@tresata.com>>
wrote:
> i understand that checkpoint cuts the lineage, but i am not fully sure i understand the
role of eager. 
> 
> eager simply seems to materialize the rdd early with a count, right after the rdd has
been checkpointed. but why is that useful? rdd.checkpoint is asynchronous, so when the rdd.count
happens most likely rdd.isCheckpointed will be false, and the count will be on the rdd before
it was checkpointed. what is the benefit of that?
> 
> 
> On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz <brkyvz@gmail.com <mailto:brkyvz@gmail.com>>
wrote:
> Hi,
> 
> One of the goals of checkpointing is to cut the RDD lineage. Otherwise you run into StackOverflowExceptions.
If you eagerly checkpoint, you basically cut the lineage there, and the next operations all
depend on the checkpointed DataFrame. If you don't checkpoint, you continue to build the lineage,
therefore while that lineage is being resolved, you may hit the StackOverflowException.
> 
> HTH,
> Burak
> 
> On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin <jgp@jgp.net <mailto:jgp@jgp.net>>
wrote:
> Hey Sparkers,
> 
> Trying to understand the Dataframe's checkpoint (not in the context of streaming) https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)>
> 
> What is the goal of the eager flag?
> 
> Thanks!
> 
> jg
> 
> 
> 


Mime
View raw message