spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magnus Nilsson <>
Subject Re: Optimising multiple hive table join and query in spark
Date Sun, 15 Mar 2020 12:01:45 GMT
Been a while but I remember reading on Stack Overflow you can use a UDF as
a join condition to trick catalyst into not reshuffling the partitions, ie
use regular equality on the column you partitioned or bucketed by and your
custom comparer for the other columns. Never got around to try it out
hough. I really would like a native way to tell catalyst not to reshuffle
just because you use more columns in the join condition.

On Sun, Mar 15, 2020 at 6:04 AM Manjunath Shetty H <>

> Hi All,
> We have 10 tables in data warehouse (hdfs/hive) written using ORC format.
> We are serving a usecase on top of that by joining 4-5 tables using Hive as
> of now. But it is not fast as we wanted it to be, so we are thinking of
> using spark for this use case.
> Any suggestion on this ? Is it good idea to use the Spark for this use
> case ? Can we get better performance by using spark ?
> Any pointers would be helpful.
> *Notes*:
>    - Data is partitioned by date (yyyyMMdd) as integer.
>    - Query will fetch data for last 7 days from some tables while joining
>    with other tables.
> *Approach we thought of as now :*
>    - Create dataframe for each table and partition by same column for all
>    tables ( Lets say Country as partition column )
>    - Register all tables as temporary tables
>    - Run the sql query with joins
> But the problem we are seeing with this approach is , even though we
> already partitioned using country it still does hashParittioning +
> shuffle during join. All the table join contain `Country` column with some
> extra column based on the table.
> Is there any way to avoid these shuffles ? and improve performance ?
> Thanks and regards
> Manjunath

View raw message