spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manjunath Shetty H <>
Subject Re: Optimising multiple hive table join and query in spark
Date Sun, 15 Mar 2020 16:07:02 GMT
Only partitioned and Join keys are not sorted coz those are written incrementally with batch
From: Georg Heiler <>
Sent: Sunday, March 15, 2020 8:30:53 PM
To: Manjunath Shetty H <>
Cc: ayan guha <>; Magnus Nilsson <>; user <>
Subject: Re: Optimising multiple hive table join and query in spark

Did you only partition or also bucket by the join column? Are ORCI indices active i.e. the
JOIN keys sorted when writing the files?


Am So., 15. März 2020 um 15:52 Uhr schrieb Manjunath Shetty H <<>>:
Mostly the concern is the reshuffle. Even though all the DF's are partitioned by same column.
During join it does reshuffle, that is the bottleneck as of now in our POC implementation.

Is there any way tell spark that keep all partitions with same partition key at the same place
so that during the join it wont do shuffle again.

From: ayan guha <<>>
Sent: Sunday, March 15, 2020 5:46 PM
To: Magnus Nilsson <<>>
Cc: user <<>>
Subject: Re: Optimising multiple hive table join and query in spark


I would first and foremost try to identify where is the most time spend during the query.
One possibility is it just takes ramp up time for executors to be available, if thats the
case then maybe a dedicated yarn queue may help, or using Spark thriftserver may help.

On Sun, Mar 15, 2020 at 11:02 PM Magnus Nilsson <<>>
Been a while but I remember reading on Stack Overflow you can use a UDF as a join condition
to trick catalyst into not reshuffling the partitions, ie use regular equality on the column
you partitioned or bucketed by and your custom comparer for the other columns. Never got around
to try it out hough. I really would like a native way to tell catalyst not to reshuffle just
because you use more columns in the join condition.

On Sun, Mar 15, 2020 at 6:04 AM Manjunath Shetty H <<>>
Hi All,

We have 10 tables in data warehouse (hdfs/hive) written using ORC format. We are serving a
usecase on top of that by joining 4-5 tables using Hive as of now. But it is not fast as we
wanted it to be, so we are thinking of using spark for this use case.

Any suggestion on this ? Is it good idea to use the Spark for this use case ? Can we get better
performance by using spark ?

Any pointers would be helpful.


  *   Data is partitioned by date (yyyyMMdd) as integer.
  *   Query will fetch data for last 7 days from some tables while joining with other tables.

Approach we thought of as now :

  *   Create dataframe for each table and partition by same column for all tables ( Lets say
Country as partition column )
  *   Register all tables as temporary tables
  *   Run the sql query with joins

But the problem we are seeing with this approach is , even though we already partitioned using
country it still does hashParittioning + shuffle during join. All the table join contain `Country`
column with some extra column based on the table.

Is there any way to avoid these shuffles ? and improve performance ?

Thanks and regards

Best Regards,
Ayan Guha

View raw message