spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (Jira)" <>
Subject [jira] [Commented] (SPARK-31588) merge small files may need more common setting
Date Fri, 08 May 2020 02:57:00 GMT


Hyukjin Kwon commented on SPARK-31588:

the repartition won't set the hard limit on the size. You should rather control the block
size in HDFS.

> merge small files may need more common setting
> ----------------------------------------------
>                 Key: SPARK-31588
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5
>         Environment: spark:2.4.5
> hdp:2.7
>            Reporter: philipse
>            Priority: Major
> Hi ,
> SparkSql now allow us to use  repartition or coalesce to manually control the small
files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be  tuning case by case ,we need to decide whether we need to use COALESCE
or REPARTITION,can we try a more common way to reduce the decision by set the target size 
as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter  provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be more easier
to controll samll files.
> 4)greatly reduce the pressue of namenode
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out files.
> I don't know whether we have planned this in future.
> Thanks

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message