spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-16980) Load only catalog table partition metadata required to answer a query
Date Tue, 11 Oct 2016 01:05:20 GMT

     [ https://issues.apache.org/jira/browse/SPARK-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Reynold Xin updated SPARK-16980:
--------------------------------
    Issue Type: Sub-task  (was: Improvement)
        Parent: SPARK-17861

> Load only catalog table partition metadata required to answer a query
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16980
>                 URL: https://issues.apache.org/jira/browse/SPARK-16980
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Michael Allman
>            Assignee: Michael Allman
>
> Currently, when a user reads from a partitioned Hive table whose metadata are not cached
(and for which Hive table conversion is enabled and supported), all partition metadata are
fetched from the metastore:
> https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260
> However, if the user's query includes partition pruning predicates then we only need
the subset of these metadata which satisfy those predicates.
> This issue tracks work to modify the current query planning scheme so that unnecessary
partition metadata are not loaded.
> I've prototyped two possible approaches. The first extends {{o.a.s.s.c.catalog.ExternalCatalog}}
and as such is more generally applicable. It requires some new abstractions and refactoring
of {{HadoopFsRelation}} and {{FileCatalog}}, among others. It places a greater burden on other
implementations of {{ExternalCatalog}}. Currently the only other implementation of {{ExternalCatalog}}
is {{InMemoryCatalog}}, and my prototype throws an {{UnsupportedOperationException}} on that
implementation.
> The second prototype is simpler and only touches code in the {{hive}} project. Basically,
conversion of a partitioned {{MetastoreRelation}} to {{HadoopFsRelation}} is deferred to physical
planning. During physical planning, the partition pruning filters in the query plan are used
to identify the required partition metadata and a {{HadoopFsRelation}} is built from those.
The new query plan is then re-injected into the physical planner and proceeds as normal for
a {{HadoopFsRelation}}.
> On the Spark dev mailing list, [~ekhliang] expressed a preference for the approach I
took in my first POC. (See http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html)
Based on that, I'm going to open a PR with that patch as a starting point for an architectural/design
review. It will not be a complete patch ready for integration into Spark master. Rather, I
would like to get early feedback on the implementation details so I can shape the PR before
committing a large amount of time on a finished product. I will open another PR for the second
approach for comparison if requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message