spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Theodoros Gkountouvas <theo.gkountou...@futurewei.com>
Subject RE: Caching
Date Mon, 07 Dec 2020 16:54:42 GMT
Hi Amit,

One action might use the same DataFrame more than once. You can look at your LogicalPlan by
executing DF3.explain (arguments different depending the version of Spark you are using) and
see how many times you need to compute DF2 or DF1. Given the information you have provided
I suspect that DF1 is used more than once (one time at  DF2 and another one at DF3). So, Spark
is going to cache it the first time and it will load it from cache instead of running it again
the second time.

I hope this helped,
Theo.

From: Amit Sharma <resolve123@gmail.com>
Sent: Monday, December 7, 2020 11:32 AM
To: user@spark.apache.org
Subject: Caching

Hi All, I am using caching in my code. I have a DF like
val  DF1 = read csv.
val DF2 = DF1.groupBy().agg().select(.....)

Val DF3 =  read csv .join(DF1).join(DF2)
  DF3 .save.

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why
do I need to cache.

Thanks
Amit


Mime
View raw message