spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Sharma <resolve...@gmail.com>
Subject Re: Caching
Date Mon, 07 Dec 2020 17:46:37 GMT
Thanks for the information. I am using  spark 2.3.3 There are few more
questions

1. Yes I am using DF1 two times but at the end action is one on DF3. In
that case action of DF1 should be just 1 or it depends how many times this
dataframe is used in transformation.

I believe even if we use a dataframe multiple times for transformation ,
use caching should be based on actions. In my case action is one save call
on DF3. Please correct me if i am wrong.

Thanks
Amit

On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <
theo.gkountouvas@futurewei.com> wrote:

> Hi Amit,
>
>
>
> One action might use the same DataFrame more than once. You can look at
> your LogicalPlan by executing DF3.explain (arguments different depending
> the version of Spark you are using) and see how many times you need to
> compute DF2 or DF1. Given the information you have provided I suspect that
> DF1 is used more than once (one time at  DF2 and another one at DF3). So,
> Spark is going to cache it the first time and it will load it from cache
> instead of running it again the second time.
>
>
>
> I hope this helped,
>
> Theo.
>
>
>
> *From:* Amit Sharma <resolve123@gmail.com>
> *Sent:* Monday, December 7, 2020 11:32 AM
> *To:* user@spark.apache.org
> *Subject:* Caching
>
>
>
> Hi All, I am using caching in my code. I have a DF like
>
> val  DF1 = read csv.
>
> val DF2 = DF1.groupBy().agg().select(.....)
>
>
>
> Val DF3 =  read csv .join(DF1).join(DF2)
>
>   DF3 .save.
>
>
>
> If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1
> action only why do I need to cache.
>
>
>
> Thanks
>
> Amit
>
>
>
>
>

Mime
View raw message