spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "I Gia (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-19629) Partitioning of Parquet is not considered correctly at loading in local[X] mode
Date Thu, 16 Feb 2017 12:38:41 GMT
I Gia created SPARK-19629:
-----------------------------

             Summary: Partitioning of Parquet is not considered correctly at loading in local[X]
mode
                 Key: SPARK-19629
                 URL: https://issues.apache.org/jira/browse/SPARK-19629
             Project: Spark
          Issue Type: Bug
          Components: Input/Output, Spark Core
    Affects Versions: 2.1.0, 2.0.0
         Environment: Tested using docker run gettyimages/spark:1.6.1-hadoop-2.6 and
docker run gettyimages/spark:2.1.0-hadoop-2.7.
            Reporter: I Gia
            Priority: Minor


Running the following two examples will lead to different results depending on whether the
code is run using Spark 1.6 or Spark 2.1. 

What does the example do?
===========================
-The code creates an exemplary dataframe with random data. 
- The dataframe is repartitioned and stored to disk. 
- Then the dataframe is re-read from disk.
- The number of partitions of the dataframe is considered.

What is the expected behaviour?
===========================
The number of partitions specified when storing the dataframe should be the same as when re-loading
the dataframe from disk.

On Spark 1.6 the partitioning is kept, i.e., the code example will return 10 partitions as
specified using npartitions; on Spark 2.1 the number of partitions will equal the number of
local nodes specified when starting Spark (using local[X] as master).

Minimal code example
===========================
# run on Spark 1.6
import scala.util.Random
import org.apache.spark.sql.types.{StructField, StructType, FloatType}
import org.apache.spark.sql.Row
 val rdd = sc.parallelize(Seq.fill(100)(Row(Seq(Random.nextFloat()): _*)))
val df = sqlContext.createDataFrame(rdd, StructType(Seq(StructField("test", FloatType))))
val npartitions = 10
df.repartition(npartitions).write.parquet("/tmp/test1")
val read = sqlContext.read.parquet("/tmp/test1")
assert(npartitions == read.rdd.getNumPartitions) //true on Spark 1.6


# run on Spark 2.1
import scala.util.Random
import org.apache.spark.sql.types.{StructField, StructType, FloatType}
import org.apache.spark.sql.Row
val rdd = sc.parallelize(Seq.fill(100)(Row(Seq(Random.nextFloat()): _*)))
val df = spark.sqlContext.createDataFrame(rdd, StructType(Seq(StructField("test", FloatType))))
val npartitions = 10
df.repartition(npartitions).write.parquet("/tmp/test1")
val read = spark.sqlContext.read.parquet("/tmp/test1")
assert(npartitions == read.rdd.getNumPartitions) //false on Spark 2.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message