[ https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15541086#comment-15541086
]
Apache Spark commented on SPARK-16575:
--------------------------------------
User 'fidato13' has created a pull request for this issue:
https://github.com/apache/spark/pull/15327
> partition calculation mismatch with sc.binaryFiles
> --------------------------------------------------
>
> Key: SPARK-16575
> URL: https://issues.apache.org/jira/browse/SPARK-16575
> Project: Spark
> Issue Type: Bug
> Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell
> Affects Versions: 1.6.1, 1.6.2
> Reporter: Suhas
> Priority: Critical
>
> sc.binaryFiles is always creating an RDD with number of partitions as 2.
> Steps to reproduce: (Tested this bug on databricks community edition)
> 1. Try to create an RDD using sc.binaryFiles. In this example, airlines folder has 1922
files.
> Ex: val binaryRDD = sc.binaryFiles("/databricks-datasets/airlines/*")
> 2. check the number of partitions of the above RDD
> - binaryRDD.partitions.size = 2. (expected value is more than 2)
> 3. If the RDD is created using sc.textFile, then the number of partitions are 1921.
> 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 version.
> For explanation with screenshot, please look at the link below,
> http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org
|