spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mendelson, Assaf" <>
Subject RE: About transformations
Date Fri, 09 Dec 2016 12:56:54 GMT
This is a guess but I would bet that most of the time when into the loading of the data. The
second time there are many places this could be cached (either  by spark or even by the OS
if you are reading from file).

-----Original Message-----
From: brccosta [] 
Sent: Friday, December 09, 2016 1:24 PM
Subject: About transformations

Dear guys,

We're performing some tests to evaluate the behavior of transformations and actions in Spark
with Spark SQL. In our tests, first we conceive a simple dataflow with 2 transformations and
1 action:

LOAD (result: df_1) > SELECT ALL FROM df_1 (result: df_2) > COUNT(df_2)

The execution time for this first dataflow was 10 seconds. Next, we added another action to
our dataflow:

LOAD (result: df_1) > SELECT ALL FROM df_1 (result: df_2) > COUNT(df_2) >

Analyzing the second version of the dataflow, since all transformation are lazy and re-executed
for each action (according to the documentation), when executing the second count, it should
require the execution of the two previous transformations (LOAD and SELECT ALL). Thus, we
expected that when executing this second version of our dataflow, the time would be around
20 seconds. However, the execution time was 11 seconds. Apparently, the results of the transformations
required by the first count were cached by Spark for the second count.

Please, do you guys know what is happening? 

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe e-mail:

To unsubscribe e-mail:

View raw message