spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <>
Subject [jira] [Commented] (SPARK-16575) partition calculation mismatch with sc.binaryFiles
Date Sun, 02 Oct 2016 22:53:20 GMT


Reynold Xin commented on SPARK-16575:

The old behavior doesn't make sense either, because it risks creating tons of partitions in
practice when there are a lot of small files. Again, I think the appropriate fix is to take
into account there is a cost to processing each file (i.e. a file whose size is zero should
not be treated as "free"), similar to what Spark SQL does with the setting "spark.sql.files.openCostInBytes".

> partition calculation mismatch with sc.binaryFiles
> --------------------------------------------------
>                 Key: SPARK-16575
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Suhas
>            Priority: Critical
> sc.binaryFiles is always creating an RDD with number of partitions as 2.
> Steps to reproduce: (Tested this bug on databricks community edition)
> 1. Try to create an RDD using sc.binaryFiles. In this example, airlines folder has 1922
>      Ex: {noformat}val binaryRDD = sc.binaryFiles("/databricks-datasets/airlines/*"){noformat}
> 2. check the number of partitions of the above RDD
>     - binaryRDD.partitions.size = 2. (expected value is more than 2)
> 3. If the RDD is created using sc.textFile, then the number of partitions are 1921.
> 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 version.
> For explanation with screenshot, please look at the link below,

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message