spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-31588) merge small files may need more common setting
Date Fri, 08 May 2020 02:57:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102195#comment-17102195
] 

Hyukjin Kwon commented on SPARK-31588:
--------------------------------------

the repartition won't set the hard limit on the size. You should rather control the block
size in HDFS.

> merge small files may need more common setting
> ----------------------------------------------
>
>                 Key: SPARK-31588
>                 URL: https://issues.apache.org/jira/browse/SPARK-31588
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5
>         Environment: spark:2.4.5
> hdp:2.7
>            Reporter: philipse
>            Priority: Major
>
> Hi ,
> SparkSql now allow us to use  repartition or coalesce to manually control the small
files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be  tuning case by case ,we need to decide whether we need to use COALESCE
or REPARTITION,can we try a more common way to reduce the decision by set the target size 
as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter  provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be more easier
to controll samll files.
> 4)greatly reduce the pressue of namenode
>  
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out files.
>  
> I don't know whether we have planned this in future.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message