spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <>
Subject [jira] [Resolved] (SPARK-16980) Load only catalog table partition metadata required to answer a query
Date Sat, 15 Oct 2016 01:26:20 GMT


Reynold Xin resolved SPARK-16980.
       Resolution: Fixed
    Fix Version/s: 2.1.0

> Load only catalog table partition metadata required to answer a query
> ---------------------------------------------------------------------
>                 Key: SPARK-16980
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Michael Allman
>            Assignee: Michael Allman
>             Fix For: 2.1.0
> Currently, when a user reads from a partitioned Hive table whose metadata are not cached
(and for which Hive table conversion is enabled and supported), all partition metadata are
fetched from the metastore:
> However, if the user's query includes partition pruning predicates then we only need
the subset of these metadata which satisfy those predicates.
> This issue tracks work to modify the current query planning scheme so that unnecessary
partition metadata are not loaded.
> I've prototyped two possible approaches. The first extends {{o.a.s.s.c.catalog.ExternalCatalog}}
and as such is more generally applicable. It requires some new abstractions and refactoring
of {{HadoopFsRelation}} and {{FileCatalog}}, among others. It places a greater burden on other
implementations of {{ExternalCatalog}}. Currently the only other implementation of {{ExternalCatalog}}
is {{InMemoryCatalog}}, and my prototype throws an {{UnsupportedOperationException}} on that
> The second prototype is simpler and only touches code in the {{hive}} project. Basically,
conversion of a partitioned {{MetastoreRelation}} to {{HadoopFsRelation}} is deferred to physical
planning. During physical planning, the partition pruning filters in the query plan are used
to identify the required partition metadata and a {{HadoopFsRelation}} is built from those.
The new query plan is then re-injected into the physical planner and proceeds as normal for
a {{HadoopFsRelation}}.
> On the Spark dev mailing list, [~ekhliang] expressed a preference for the approach I
took in my first POC. (See
Based on that, I'm going to open a PR with that patch as a starting point for an architectural/design
review. It will not be a complete patch ready for integration into Spark master. Rather, I
would like to get early feedback on the implementation details so I can shape the PR before
committing a large amount of time on a finished product. I will open another PR for the second
approach for comparison if requested.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message