spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (Jira)" <>
Subject [jira] [Commented] (SPARK-31588) merge small files may need more common setting
Date Sun, 10 May 2020 05:26:00 GMT


Hyukjin Kwon commented on SPARK-31588:

I can't completely get the point of the physical size and repartitioning. Repartition API
isn't based on the physical size. Even if you want to make your file size even, what if the
data is already evenly distributed? Shuffle is very expensive and I don't believe adding a
general configuration makes much sense here.

> merge small files may need more common setting
> ----------------------------------------------
>                 Key: SPARK-31588
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5
>         Environment: spark:2.4.5
> hdp:2.7
>            Reporter: philipse
>            Priority: Major
> Hi ,
> SparkSql now allow us to use  repartition or coalesce to manually control the small
files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be  tuning case by case ,we need to decide whether we need to use COALESCE
or REPARTITION,can we try a more common way to reduce the decision by set the target size 
as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter  provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be more easier
to controll samll files.
> 4)greatly reduce the pressue of namenode
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out files.
> I don't know whether we have planned this in future.
> Thanks

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message