spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <>
Subject Re: Optimal way to re-partition from a single partition
Date Tue, 09 Feb 2016 15:39:42 GMT

DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of
`RangePartitioning` roughly samples input data and internally computes
partition bounds
to split given rows into `spark.sql.shuffle.partitions` partitions.
Therefore, when sort keys are highly skewed, I think some partitions could
end up being empty
(that is, # of result partitions is lower than `spark.sql.shuffle.partitions`

On Tue, Feb 9, 2016 at 9:35 PM, Hemant Bhanawat <>

> For sql shuffle operations like groupby, the number of output partitions
> is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does
> not honour this.
> In my small test, I could see that the number of partitions  in DF
> returned by orderBy was equal to the total number of distinct keys. Are you
> observing the same, I mean do you have a single value for all rows in the
> column on which you are running orderBy? If yes, you are better off not
> running the orderBy clause.
> May be someone from spark sql team could answer that how should the
> partitioning of the output DF be handled when doing an orderBy?
> Hemant
> On Tue, Feb 9, 2016 at 4:00 AM, Cesar Flores <> wrote:
>> I have a data frame which I sort using orderBy function. This operation
>> causes my data frame to go to a single partition. After using those
>> results, I would like to re-partition to a larger number of partitions.
>> Currently I am just doing:
>> val rdd = df.rdd.coalesce(100, true) //df is a dataframe with a single
>> partition and around 14 million records
>> val newDF =  hc.createDataFrame(rdd, df.schema)
>> This process is really slow. Is there any other way of achieving this
>> task, or to optimize it (perhaps tweaking a spark configuration parameter)?
>> Thanks a lot
>> --
>> Cesar Flores

Takeshi Yamamuro

View raw message