spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16575) partition calculation mismatch with sc.binaryFiles
Date Sun, 02 Oct 2016 22:45:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15541097#comment-15541097
] 

Reynold Xin commented on SPARK-16575:
-------------------------------------

As responded on the pull request: "I don't actually think this is a bug, because it is intended
to do some coalescing. If there is an issue, the issue would be that we don't take the cost
of individual files into account in this code path. The Spark SQL automatic coalescing code
path does take that into account."

> partition calculation mismatch with sc.binaryFiles
> --------------------------------------------------
>
>                 Key: SPARK-16575
>                 URL: https://issues.apache.org/jira/browse/SPARK-16575
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Suhas
>            Priority: Critical
>
> sc.binaryFiles is always creating an RDD with number of partitions as 2.
> Steps to reproduce: (Tested this bug on databricks community edition)
> 1. Try to create an RDD using sc.binaryFiles. In this example, airlines folder has 1922
files.
>      Ex: val binaryRDD = sc.binaryFiles("/databricks-datasets/airlines/*")
> 2. check the number of partitions of the above RDD
>     - binaryRDD.partitions.size = 2. (expected value is more than 2)
> 3. If the RDD is created using sc.textFile, then the number of partitions are 1921.
> 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 version.
> For explanation with screenshot, please look at the link below,
> http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message