spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean R. Owen (Jira)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-28853) Support conf to organize filePartitions by file path
Date Sat, 26 Oct 2019 20:55:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-28853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean R. Owen resolved SPARK-28853.
----------------------------------
    Resolution: Won't Fix

>  Support conf to organize filePartitions by file path
> -----------------------------------------------------
>
>                 Key: SPARK-28853
>                 URL: https://issues.apache.org/jira/browse/SPARK-28853
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: ZhangYao
>            Priority: Major
>
> When dynamicly writing data to hdfs it may generates a lot of small files, so sometimes
we need to merge those files. When reading this files and writing again, it will be helpful
if the read file RDD partitions is formed by partitions on hdfs.
> Currently in FileSourceScanExec.createNonBucketedReadRDD after spliting files, spark
will sort files with file size so it may scatter the partition distribution of the data files.
It is a great help to support sort by file path here :)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message