spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From snjv <>
Subject [Spark sql]: Re-execution of same operation takes less time than 1st
Date Tue, 03 Apr 2018 05:42:16 GMT

When we execute the same operation twice, spark takes less time ( ~40%) than
the first.
Our operation is like this: 
Read 150M rows ( spread in multiple parquet files) into DF
Read 10M rows ( spread in multiple parquet files) into other DF.
Do an intersect operation.

Size of 150M row file: 587MB
size of 10M file: 50M

If first execution takes around 20 sec the next one will take just 10-12
Any specific reason for this? Is any optimization is there that we can
utilize during the first operation?


Sent from:

To unsubscribe e-mail:

View raw message