spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "philipse (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-31588) merge small files may need more common setting
Date Thu, 07 May 2020 15:22:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101775#comment-17101775
] 

philipse commented on SPARK-31588:
----------------------------------

For example:

if we have output 3 files,size as 10M,50M,200M,the block size as 128M,we may keep the file
size more close the average,but we also should keep the size bigger than the block, just in
case someone set wrong paramters. 

case 1:we set the target size as 60M.the  expected average file size as Max(blocksize,60M)
it will output an integer file count as the repartition number :[total_file_size /average
file size]+1

the final result will be 3 files:size as 128M,128M,4M

 

if we set the target size as 5120M, then it will repartition as 1 file. size as  260M.

thus ,we can set the target size as the global paramter,it will benefit all task.

> merge small files may need more common setting
> ----------------------------------------------
>
>                 Key: SPARK-31588
>                 URL: https://issues.apache.org/jira/browse/SPARK-31588
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5
>         Environment: spark:2.4.5
> hdp:2.7
>            Reporter: philipse
>            Priority: Major
>
> Hi ,
> SparkSql now allow us to use  repartition or coalesce to manually control the small
files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be  tuning case by case ,we need to decide whether we need to use COALESCE
or REPARTITION,can we try a more common way to reduce the decision by set the target size 
as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter  provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be more easier
to controll samll files.
> 4)greatly reduce the pressue of namenode
>  
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out files.
>  
> I don't know whether we have planned this in future.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message