spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Self Join reading the HDFS blocks TWICE
Date Wed, 30 Sep 2015 00:58:21 GMT
You could try caching the table.  This would avoid the double read, but not
the shuffle (at least today with the current optimizer).

On Tue, Sep 29, 2015 at 5:21 PM, Data Science Education <
datasciedu@gmail.com> wrote:

> As part of fairly complex processing, I am executing a self join query
> using HiveContext against a Hive table to find the latest Transaction,
> oldest Transaction etc: for a given set of Attributes. I am still using
> v1.3.1 and so Window functions are not an option. The simplified query
> looks like below.
>
> val df = hiveContext.sql("""SELECT TAB1.KEY1 ,TAB1.KEY2
>
> ,MAX(CASE WHEN (TAB1.FLD10 = TAB2.min_FLD10) THEN TAB1.FLD11
>
> ELSE -9999999 END) AS NEW_FLD
>
> FROM TAB1
>
> INNER JOIN
>
> ( SELECT KEY1 ,KEY2 ,
>
> MIN(FLD10) AS min_FLD10
>
> FROM TAB1
>
> WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01'
>
> GROUP BY KEY1 ,KEY2 ) TAB2
>
> ON TAB1.KEY1= TAB2.KEY1AND TAB1.KEY2= TAB2.KEY1
>
> WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01'
>
> GROUP BY TAB1.KEY1, TAB1.KEY2""")
>
> I see that ~18,000 HDFS blocks are read TWICE and then the Shuffle happens
> . Is there a way to avoid reading the same blocks TWICE during the Map
> Stage? Is there a way to try to avoid Shuffle? Thank You.
>

Mime
View raw message