spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Data Science Education <datasci...@gmail.com>
Subject Self Join reading the HDFS blocks TWICE
Date Wed, 30 Sep 2015 00:21:48 GMT
As part of fairly complex processing, I am executing a self join query
using HiveContext against a Hive table to find the latest Transaction,
oldest Transaction etc: for a given set of Attributes. I am still using
v1.3.1 and so Window functions are not an option. The simplified query
looks like below.

val df = hiveContext.sql("""SELECT TAB1.KEY1 ,TAB1.KEY2

,MAX(CASE WHEN (TAB1.FLD10 = TAB2.min_FLD10) THEN TAB1.FLD11

ELSE -9999999 END) AS NEW_FLD

FROM TAB1

INNER JOIN

( SELECT KEY1 ,KEY2 ,

MIN(FLD10) AS min_FLD10

FROM TAB1

WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01'

GROUP BY KEY1 ,KEY2 ) TAB2

ON TAB1.KEY1= TAB2.KEY1AND TAB1.KEY2= TAB2.KEY1

WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01'

GROUP BY TAB1.KEY1, TAB1.KEY2""")

I see that ~18,000 HDFS blocks are read TWICE and then the Shuffle happens
. Is there a way to avoid reading the same blocks TWICE during the Map
Stage? Is there a way to try to avoid Shuffle? Thank You.

Mime
View raw message