spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From naresh Goud <nareshgoud.du...@gmail.com>
Subject Re: [Spark sql]: Re-execution of same operation takes less time than 1st
Date Tue, 03 Apr 2018 16:09:25 GMT
Whenever spark read the data from it will have it in executor memory until
and unless there is no room for new data read or processed. This is the
beauty of spark.


On Tue, Apr 3, 2018 at 12:42 AM snjv <snjv.workmail@gmail.com> wrote:

> Hi,
>
> When we execute the same operation twice, spark takes less time ( ~40%)
> than
> the first.
> Our operation is like this:
> Read 150M rows ( spread in multiple parquet files) into DF
> Read 10M rows ( spread in multiple parquet files) into other DF.
> Do an intersect operation.
>
> Size of 150M row file: 587MB
> size of 10M file: 50M
>
> If first execution takes around 20 sec the next one will take just 10-12
> sec.
> Any specific reason for this? Is any optimization is there that we can
> utilize during the first operation?
>
> Regards
> Sanjeev
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> --
Thanks,
Naresh
www.linkedin.com/in/naresh-dulam
http://hadoopandspark.blogspot.com/

Mime
View raw message