spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Budde (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (SPARK-19455) Add option for case-insensitive Parquet field resolution
Date Wed, 15 Feb 2017 19:41:41 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adam Budde closed SPARK-19455.
------------------------------
    Resolution: Duplicate

Closing in favor of https://issues.apache.org/jira/browse/SPARK-19611

> Add option for case-insensitive Parquet field resolution
> --------------------------------------------------------
>
>                 Key: SPARK-19455
>                 URL: https://issues.apache.org/jira/browse/SPARK-19455
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Adam Budde
>
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the schema inferrence
from the HiveMetastoreCatalog class when converting a MetastoreRelation to a LoigcalRelation
(HadoopFsRelation, in this case) in favor of simply using the schema returend by the metastore.
This results in an optimization as the underlying file status no longer need to be resolved
until after the partition pruning step, reducing the number of files to be touched significantly
in some cases. The downside is that the data schema used may no longer match the underlying
file schema for case-sensitive formats such as Parquet.
> This change initially included a [patch to ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
that attempted to remedy this conflict by using a case-insentive fallback mapping when resolving
field names during the schema clipping step. [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333]
 later removed this patch after [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183]
added support for embedding a case-sensitive schema as a Hive Metastore table property. AFAIK
the assumption here was that the data schema obtained from the Metastore table property will
be case sensitive and should match the Parquet schema exactly.
> The problem arises when dealing with Parquet-backed tables for which this schema has
not been embedded as a table attributes and for which the underlying files contain case-sensitive
field names. This will happen for any Hive table that was not created by Spark or created
by a version prior to 2.1.0. We've seen Spark SQL return no results for any query containing
a case-sensitive field name for such tables.
> The change we're proposing is to introduce a configuration parameter that will re-enable
case-insensitive field name resolution in ParquetReadSupport. This option will also disable
filter push-down for Parquet, as the filter predicate constructed by Spark SQL contains the
case-insensitive field names which Parquet will return 0 records for when filtering against
a case-sensitive column name. I was hoping to find a way to construct the filter on-the-fly
in ParquetReadSupport but Parquet doesn't propegate the Configuration object passed to this
class to the underlying InternalParquetRecordReader class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message