spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Budde (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
Date Thu, 16 Feb 2017 16:45:42 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870262#comment-15870262
] 

Adam Budde commented on SPARK-19611:
------------------------------------

[SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support for saving the
case-sensitive schema in the table properties in order to avoid the conflicts introduced by
Hive metastore downcasing without the need for schema inference. The problem stated here occurs
when Spark doesn't find a case-sensitive schema in the table properties and falls back to
the case insensitive metastore schema. This will happen for any Hive table that wasn't created
by Spark or that was created with Spark 2.1.0.

In the [PR 16797 discussion|https://github.com/apache/spark/pull/16797] I provided my reasoning
for why I think simply offering a way to perform migrations to write case-sensitive schemas
to the table properties won't be sufficient to solve this alone.

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> -----------------------------------------------------------------------
>
>                 Key: SPARK-19611
>                 URL: https://issues.apache.org/jira/browse/SPARK-19611
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Adam Budde
>
> This issue replaces [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and
[PR #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the schema inferrence
from the HiveMetastoreCatalog class when converting a MetastoreRelation to a LoigcalRelation
(HadoopFsRelation, in this case) in favor of simply using the schema returend by the metastore.
This results in an optimization as the underlying file status no longer need to be resolved
until after the partition pruning step, reducing the number of files to be touched significantly
in some cases. The downside is that the data schema used may no longer match the underlying
file schema for case-sensitive formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support for saving
a case-sensitive copy of the schema in the metastore table properties, which HiveExternalCatalog
will read in as the table's schema if it is present. If it is not present, it will fall back
to the case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying data fields
are case-sensitive but a case-sensitive schema wasn't written to the table properties by Spark.
This situation will occur for any Hive table that wasn't created by Spark or that was created
prior to Spark 2.1.0. If a user attempts to run a query over such a table containing a case-sensitive
field name in the query projection or in the query filter, the query will return 0 results
in every case.
> The change we are proposing is to bring back the schema inference that was used prior
to Spark 2.1.0 if a case-sensitive schema can't be read from the table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive schema can
be read from the table properties. Attempt to save the inferred schema in the table properties
to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but don't attempt
to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the Hive Metatore.
Useful if the user knows that none of the underlying data is case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] for more
discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message