spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruce Robbins (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-26990) Difference in handling of mixed-case partition columns after SPARK-26188
Date Tue, 26 Feb 2019 00:59:00 GMT
Bruce Robbins created SPARK-26990:
-------------------------------------

             Summary: Difference in handling of mixed-case partition columns after SPARK-26188
                 Key: SPARK-26990
                 URL: https://issues.apache.org/jira/browse/SPARK-26990
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.1
            Reporter: Bruce Robbins


I noticed that the [PR for SPARK-26188|https://github.com/apache/spark/pull/23165] changed
how mixed-cased partition columns are handled when the user provides a schema.

Say I have this file structure (note that each instance of `pS` is mixed case):
{noformat}
bash-3.2$ find partitioned5 -type d
partitioned5
partitioned5/pi=2
partitioned5/pi=2/pS=foo
partitioned5/pi=2/pS=bar
partitioned5/pi=1
partitioned5/pi=1/pS=foo
partitioned5/pi=1/pS=bar
bash-3.2$
{noformat}
If I load the file with a user-provided schema in 2.4 (before the PR was committed) or 2.3,
I see:
{noformat}
scala> val df = spark.read.schema("intField int, pi int, ps string").parquet("partitioned5")
df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
scala> df.printSchema
root
 |-- intField: integer (nullable = true)
 |-- pi: integer (nullable = true)
 |-- ps: string (nullable = true)
scala>
{noformat}
However, using 2.4 after the PR was committed. I see:
{noformat}
scala> val df = spark.read.schema("intField int, pi int, ps string").parquet("partitioned5")
df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
scala> df.printSchema
root
 |-- intField: integer (nullable = true)
 |-- pi: integer (nullable = true)
 |-- pS: string (nullable = true)
scala>
{noformat}
Spark is picking up the mixed-case column name {{pS}} from the directory name, not the lower-case
{{ps}} from my specified schema.

In all tests, {{spark.sql.caseSensitive}} is set to the default (false).

Not sure is this is an bug, but it is a difference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message