spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vidya Sujeet <sjayatheer...@gmail.com>
Subject Re: Spark SQL, dataframe join questions.
Date Wed, 29 Mar 2017 19:45:54 GMT
In repartition, every element in the partition is moved to a new
partition..doing a full shuffle compared to shuffles done by reduceBy
clauses. With this in mind, repartition would increase your query
performance. ReduceBy key will also shuffle based on the aggregation.

The best way to design is to check the query plan of your data frame join
query and do RDD joins accordingly, if needed.


On Wed, Mar 29, 2017 at 10:55 AM, Yong Zhang <java8964@hotmail.com> wrote:

> You don't need to repartition your data just for join purpose. But if the
> either parties of join is already partitioned, Spark will use this
> advantage as part of join optimization.
>
> Should you reduceByKey before the join really depend on your join logic.
> ReduceByKey will shuffle, and following join COULD cause another shuffle.
> So I am not sure if it is a smart way.
>
> Yong
>
> ------------------------------
> *From:* shyla deshpande <deshpandeshyla@gmail.com>
> *Sent:* Wednesday, March 29, 2017 12:33 PM
> *To:* user
> *Subject:* Re: Spark SQL, dataframe join questions.
>
>
>
> On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande <deshpandeshyla@gmail.com
> > wrote:
>
>> Following are my questions. Thank you.
>>
>> 1. When joining dataframes is it a good idea to repartition on the key column that
is used in the join or
>> the optimizer is too smart so forget it.
>>
>> 2. In RDD join, wherever possible we do reduceByKey before the join to avoid a big
shuffle of data. Do we need
>> to do anything similar with dataframe joins, or the optimizer is too smart so forget
it.
>>
>>
>

Mime
View raw message