spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16575) partition calculation mismatch with sc.binaryFiles
Date Sun, 02 Oct 2016 22:53:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15541108#comment-15541108
] 

Reynold Xin commented on SPARK-16575:
-------------------------------------

The old behavior doesn't make sense either, because it risks creating tons of partitions in
practice when there are a lot of small files. Again, I think the appropriate fix is to take
into account there is a cost to processing each file (i.e. a file whose size is zero should
not be treated as "free"), similar to what Spark SQL does with the setting "spark.sql.files.openCostInBytes".


> partition calculation mismatch with sc.binaryFiles
> --------------------------------------------------
>
>                 Key: SPARK-16575
>                 URL: https://issues.apache.org/jira/browse/SPARK-16575
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Suhas
>            Priority: Critical
>
> sc.binaryFiles is always creating an RDD with number of partitions as 2.
> Steps to reproduce: (Tested this bug on databricks community edition)
> 1. Try to create an RDD using sc.binaryFiles. In this example, airlines folder has 1922
files.
>      Ex: {noformat}val binaryRDD = sc.binaryFiles("/databricks-datasets/airlines/*"){noformat}
> 2. check the number of partitions of the above RDD
>     - binaryRDD.partitions.size = 2. (expected value is more than 2)
> 3. If the RDD is created using sc.textFile, then the number of partitions are 1921.
> 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 version.
> For explanation with screenshot, please look at the link below,
> http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message