spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Suhari <>
Subject Re: Optimising multiple hive table join and query in spark
Date Sun, 15 Mar 2020 10:50:44 GMT

I am also using Spark on Hive Metastore. The performance is much more better esp. for larger
datasets. I have the feeling that the performance is better if I load the data into dataframes
and do a join instead of doing direct join within SparkSQL. But i can’t explain yet. 

Any body experiences in that ?



Von meinem iPhone gesendet

> Am 15.03.2020 um 06:04 schrieb Manjunath Shetty H <>:
> Hi All,
> We have 10 tables in data warehouse (hdfs/hive) written using ORC format. We are serving
a usecase on top of that by joining 4-5 tables using Hive as of now. But it is not fast as
we wanted it to be, so we are thinking of using spark for this use case.
> Any suggestion on this ? Is it good idea to use the Spark for this use case ? Can we
get better performance by using spark ?
> Any pointers would be helpful.
> Notes: 
> Data is partitioned by date (yyyyMMdd) as integer.
> Query will fetch data for last 7 days from some tables while joining with other tables.
> Approach we thought of as now :
> Create dataframe for each table and partition by same column for all tables ( Lets say
Country as partition column )
> Register all tables as temporary tables 
> Run the sql query with joins
> But the problem we are seeing with this approach is , even though we already partitioned
using country it still does hashParittioning + shuffle during join. All the table join contain
`Country` column with some extra column based on the table.
> Is there any way to avoid these shuffles ? and improve performance ?
> Thanks and regards 
> Manjunath

View raw message