spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <>
Subject Re: groupBy and then coalesce impacts shuffle partitions in unintended way
Date Wed, 08 Aug 2018 19:47:01 GMT
one small correction: lots of files leads to pressure on the spark driver
program when reading this data in spark.

On Wed, Aug 8, 2018 at 3:39 PM, Koert Kuipers <> wrote:

> hi,
> i am reading data from files into a dataframe, then doing a groupBy for a
> given column with a count, and finally i coalesce to a smaller number of
> partitions before writing out to disk. so roughly:
> coalesce(100).write.format(...).save(...)
> i have this setting: spark.sql.shuffle.partitions=2048
> i expect to see 2048 partitions in shuffle. what i am seeing instead is a
> shuffle with only 100 partitions. it's like the coalesce has taken over the
> partitioning of the groupBy.
> any idea why?
> i am doing coalesce because it is not helpful to write out 2048 files,
> lots of files leads to pressure down the line on executors reading this
> data (i am writing to just one partition of a larger dataset), and since i
> have less than 100 executors i expect it to be efficient. so sounds like a
> good idea, no?
> but i do need 2048 partitions in my shuffle due to the operation i am
> doing in the groupBy (in my real problem i am not just doing a count...).
> thanks!
> koert

View raw message