spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From snjv <snjv.workm...@gmail.com>
Subject [Spark sql]: Re-execution of same operation takes less time than 1st
Date Tue, 03 Apr 2018 05:42:16 GMT
Hi,

When we execute the same operation twice, spark takes less time ( ~40%) than
the first.
Our operation is like this: 
Read 150M rows ( spread in multiple parquet files) into DF
Read 10M rows ( spread in multiple parquet files) into other DF.
Do an intersect operation.

Size of 150M row file: 587MB
size of 10M file: 50M

If first execution takes around 20 sec the next one will take just 10-12
sec.
Any specific reason for this? Is any optimization is there that we can
utilize during the first operation?

Regards
Sanjeev



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message