spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Teoh <>
Subject Re: Attempting to avoid a shuffle on join
Date Sat, 06 Jul 2019 06:37:03 GMT
Dataframes have a partitionBy function too.

You can avoid a shuffle if one of your datasets is small enough to

On Thu., 4 Jul. 2019, 7:34 am Mkal, <> wrote:

> Please keep in mind i'm fairly new to spark.
> I have some spark code where i load two textfiles as datasets and after
> some
> map and filter operations to bring the columns in a specific shape, i join
> the datasets.
> The join takes place on a common column (of type string).
> Is there any way to avoid the exchange/shuffle before the join?
> As i understand it, the idea is that if i, initially, hash partition the
> datasets based on the join column, then the join would only have to look
> within the same partitions to complete the join, thus avoiding a shuffle.
> In the rdd API, you can create a hash partitioner and use partitionBy when
> creating the RDDS.(Though im not sure if this a sure way to avoid the
> shuffle on the join.) Is there any similar method for Dataframe/Dataset
> API?
> I also would like to avoid repartition,repartitionByRange and bucketing
> techniques since i only intend to do one join and these also require
> shuffling beforehand.
> --
> Sent from:
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

View raw message