spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomas Bartalos <>
Subject Howto force spark to honor parquet partitioning
Date Fri, 03 May 2019 16:38:22 GMT

I have partitioned parquet files based on "event_hour" column.
After reading parquet files to spark:"parquet").load("...")
Files from the same parquet partition are scattered in many spark

Example of mapping spark partition -> parquet partition:

Spark partition 1 -> 2019050101, 2019050102, 2019050103
Spark partition 2 -> 2019050101, 2019050103, 2019050104
Spark partition 20 -> 2019050101, ...
Spark partition 21 -> 2019050101, ...

As you can see parquet partition 2019050101 is present in Spark partition
1, 2, 20, 21.
As a result when I write out the dataFrame:

 There are many files created in one parquet partition (In case of our
example its 4 files, but in reality its much more)
To speed up queries, my goal is to write 1 file per parquet partition (1
file per hour).

So far my only solution is to use repartition:

But there is a lot of overhead with unnecessary shuffle. I'd like to force
spark to "pickup" the parquet partitioning.

In my investigation I've found
where the initial partitioning is happening based on file sizes. There is
an explicit ordering which causes parquet partition shuffle.

thank you for your help,

View raw message