spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Sharma <resolve...@gmail.com>
Subject Re: Caching
Date Mon, 07 Dec 2020 18:46:47 GMT
Jayesh, but during logical plan spark would be knowing to use the same DF
twice so it will optimize the query.


Thanks
Amit

On Mon, Dec 7, 2020 at 1:16 PM Lalwani, Jayesh <jlalwani@amazon.com> wrote:

> Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2,
> without caching,  Spark will read the CSV twice: Once to load it for DF1,
> and once to load it for DF2. When you add a cache on DF1 or DF2, it reads
> from CSV only once.
>
>
>
> You might want to look at doing a windowed  query on DF1 to avoid joining
> DF1 with DF2. This should give you better or similar  performance when
> compared to  cache because Spark will optimize for cache the data during
> shuffle.
>
>
>
> *From: *Amit Sharma <resolve123@gmail.com>
> *Reply-To: *"resolve123@gmail.com" <resolve123@gmail.com>
> *Date: *Monday, December 7, 2020 at 12:47 PM
> *To: *Theodoros Gkountouvas <theo.gkountouvas@futurewei.com>, "
> user@spark.apache.org" <user@spark.apache.org>
> *Subject: *RE: [EXTERNAL] Caching
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Thanks for the information. I am using  spark 2.3.3 There are few more
> questions
>
>
>
> 1. Yes I am using DF1 two times but at the end action is one on DF3. In
> that case action of DF1 should be just 1 or it depends how many times this
> dataframe is used in transformation.
>
>
>
> I believe even if we use a dataframe multiple times for transformation ,
> use caching should be based on actions. In my case action is one save call
> on DF3. Please correct me if i am wrong.
>
>
>
> Thanks
>
> Amit
>
>
>
> On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <
> theo.gkountouvas@futurewei.com> wrote:
>
> Hi Amit,
>
>
>
> One action might use the same DataFrame more than once. You can look at
> your LogicalPlan by executing DF3.explain (arguments different depending
> the version of Spark you are using) and see how many times you need to
> compute DF2 or DF1. Given the information you have provided I suspect that
> DF1 is used more than once (one time at  DF2 and another one at DF3). So,
> Spark is going to cache it the first time and it will load it from cache
> instead of running it again the second time.
>
>
>
> I hope this helped,
>
> Theo.
>
>
>
> *From:* Amit Sharma <resolve123@gmail.com>
> *Sent:* Monday, December 7, 2020 11:32 AM
> *To:* user@spark.apache.org
> *Subject:* Caching
>
>
>
> Hi All, I am using caching in my code. I have a DF like
>
> val  DF1 = read csv.
>
> val DF2 = DF1.groupBy().agg().select(.....)
>
>
>
> Val DF3 =  read csv .join(DF1).join(DF2)
>
>   DF3 .save.
>
>
>
> If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1
> action only why do I need to cache.
>
>
>
> Thanks
>
> Amit
>
>
>
>
>
>

Mime
View raw message