spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Robbins <bersprock...@gmail.com>
Subject Re: Dataset schema incompatibility bug when reading column partitioned data
Date Thu, 11 Apr 2019 17:53:32 GMT
I see a Jira:

https://issues.apache.org/jira/browse/SPARK-21021

On Thu, Apr 11, 2019 at 9:08 AM Dávid Szakállas <david.szakallas@gmail.com>
wrote:

> +dev for more visibility. Is this a known issue? Is there a plan for a fix?
>
> Thanks,
> David
>
> Begin forwarded message:
>
> *From: *Dávid Szakállas <david.szakallas@gmail.com>
> *Subject: **Dataset schema incompatibility bug when reading column
> partitioned data*
> *Date: *2019. March 29. 14:15:27 CET
> *To: *user@spark.apache.org
>
> We observed the following bug on Spark 2.4.0:
>
> scala> spark.createDataset(Seq((1,2))).write.partitionBy("_1").parquet("foo.parquet")
>
> scala> val schema = StructType(Seq(StructField("_1", IntegerType),StructField("_2",
IntegerType)))
>
> scala> spark.read.schema(schema).parquet("foo.parquet").as[(Int, Int)].show
> +---+---+
> | _2| _1|
> +---+---+
> |  2|  1|
> +---+- --+
>
>
> That is, when reading column partitioned Parquet files the explicitly
> specified schema is not adhered to, instead the partitioning columns are
> appended the end of the column list. This is a quite severe issue as some
> operations, such as union, fails if columns are in a different order in two
> datasets. Thus we have to work around the issue with a select:
>
> val columnNames = schema.fields.map(_.name)
> ds.select(columnNames.head, columnNames.tail: _*)
>
>
> Thanks,
> David Szakallas
> Data Engineer | Whitepages, Inc.
>
>
>

Mime
View raw message