spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <jlalw...@amazon.com.INVALID>
Subject Re: Caching
Date Mon, 07 Dec 2020 19:28:24 GMT
  *   Jayesh, but during logical plan spark would be knowing to use the same DF twice so it
will optimize the query.

No. That would mean that Spark will need to cache DF1. Spark won’t cache dataframes unless
you ask it to, even if it knows that the same dataframe is being used twice. This is because
caching data frames introduces memory overheads, and it’s not going to prematurely do it.
It will combine processing of various dataframes within a stage. However, in your case, you
are doing aggregation which will create a new stage

You can check the execution plan if you like

From: Amit Sharma <resolve123@gmail.com>
Reply-To: "resolve123@gmail.com" <resolve123@gmail.com>
Date: Monday, December 7, 2020 at 1:47 PM
To: "Lalwani, Jayesh" <jlalwani@amazon.com>, "user@spark.apache.org" <user@spark.apache.org>
Subject: RE: [EXTERNAL] Caching


CAUTION: This email originated from outside of the organization. Do not click links or open
attachments unless you can confirm the sender and know the content is safe.


Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will
optimize the query.


Thanks
Amit

On Mon, Dec 7, 2020 at 1:16 PM Lalwani, Jayesh <jlalwani@amazon.com<mailto:jlalwani@amazon.com>>
wrote:
Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, without caching,
 Spark will read the CSV twice: Once to load it for DF1, and once to load it for DF2. When
you add a cache on DF1 or DF2, it reads from CSV only once.

You might want to look at doing a windowed  query on DF1 to avoid joining DF1 with DF2. This
should give you better or similar  performance when compared to  cache because Spark will
optimize for cache the data during shuffle.

From: Amit Sharma <resolve123@gmail.com<mailto:resolve123@gmail.com>>
Reply-To: "resolve123@gmail.com<mailto:resolve123@gmail.com>" <resolve123@gmail.com<mailto:resolve123@gmail.com>>
Date: Monday, December 7, 2020 at 12:47 PM
To: Theodoros Gkountouvas <theo.gkountouvas@futurewei.com<mailto:theo.gkountouvas@futurewei.com>>,
"user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: RE: [EXTERNAL] Caching


CAUTION: This email originated from outside of the organization. Do not click links or open
attachments unless you can confirm the sender and know the content is safe.


Thanks for the information. I am using  spark 2.3.3 There are few more questions

1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of
DF1 should be just 1 or it depends how many times this dataframe is used in transformation.

I believe even if we use a dataframe multiple times for transformation , use caching should
be based on actions. In my case action is one save call on DF3. Please correct me if i am
wrong.

Thanks
Amit

On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <theo.gkountouvas@futurewei.com<mailto:theo.gkountouvas@futurewei.com>>
wrote:
Hi Amit,

One action might use the same DataFrame more than once. You can look at your LogicalPlan by
executing DF3.explain (arguments different depending the version of Spark you are using) and
see how many times you need to compute DF2 or DF1. Given the information you have provided
I suspect that DF1 is used more than once (one time at  DF2 and another one at DF3). So, Spark
is going to cache it the first time and it will load it from cache instead of running it again
the second time.

I hope this helped,
Theo.

From: Amit Sharma <resolve123@gmail.com<mailto:resolve123@gmail.com>>
Sent: Monday, December 7, 2020 11:32 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Caching

Hi All, I am using caching in my code. I have a DF like
val  DF1 = read csv.
val DF2 = DF1.groupBy().agg().select(.....)

Val DF3 =  read csv .join(DF1).join(DF2)
  DF3 .save.

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why
do I need to cache.

Thanks
Amit


Mime
View raw message