spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Smith <>
Subject Differing performance in self joins
Date Thu, 27 Aug 2015 01:10:43 GMT
I've noticed that two queries, which return identical results, have very
different performance. I'd be interested in any hints about how avoid
problems like this.

The DataFrame df contains a string field "series" and an integer "eday", the
number of days since (or before) the 1970-01-01 epoch.

I'm doing some analysis over a sliding date window and, for now, avoiding
UDAFs. I'm therefore using a self join. First, I create 

val laggard = df.withColumnRenamed("series",
"p_series").withColumnRenamed("eday", "p_eday")

Then, the following query runs in 16s:

df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") ===
(laggard("p_eday") + 1))).count

while the following query runs in 4 - 6 minutes:

df.join(laggard, (df("series") === laggard("p_series")) && ((df("eday") -
laggard("p_eday")) === 1)).count

It's worth noting that the series term is necessary to keep the query from
doing a complete cartesian product over the data.

Ideally, I'd like to look at lags of more than one day, but the following is
equally slow:

df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") -

Any advice about the general principle at work here would be welcome.


View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message