spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: SPIP: Catalog API for view metadata
Date Wed, 19 Aug 2020 23:42:38 GMT
I think it is a good idea to keep tables and views separate.

The main two arguments I’ve heard for combining lookup into a single
function are the ones brought up in this thread. First, an identifier in a
catalog must be either a view or a table and should not collide. Second, a
single lookup is more likely to require a single RPC. I think the RPC
concern is well addressed by caching, which we already do in the Spark
catalog, so I’ll primarily focus on the first.

Table/view name collision is unlikely to be a problem. Metastores that
support both today store them in a single namespace, so this is not a
concern for even a naive implementation that talks to the Hive MetaStore. I
know that a new metastore catalog could choose to implement both
ViewCatalog and TableCatalog and store the two sets separately, but that
would be a very strange choice: if the metastore itself has different
namespaces for tables and views, then it makes much more sense to expose
them through separate catalogs because Spark will always prefer one over
the other.

In a similar line of reasoning, catalogs that expose both views and tables
are much more rare than catalogs that only expose one. For example, v2
catalogs for JDBC and Cassandra expose data through the Table interface and
implementing ViewCatalog would make little sense. Exposing new data sources
to Spark requires TableCatalog, not ViewCatalog. View catalogs are likely
to be the same. Say I have a way to convert Pig statements or some other
representation into a SQL view. It would make little sense to combine that
with some other TableCatalog.

I also don’t think there is benefit from an API perspective to justify
combining the Table and View interfaces. The two share only schema and
properties, and are handled very differently internally — a View’s SQL
query is parsed and substituted into the plan, while a Table is wrapped in
a relation that eventually becomes a Scan node using SupportsRead. A view’s
SQL also needs additional context to be resolved correctly: the current
catalog and namespace from the time the view was created.

Query planning is distinct between tables and views, so Spark doesn’t
benefit from combining them. I think it has actually caused problems that
both were resolved by the same method in v1: the resolution rule grew
extremely complicated trying to look up a reference just once because it
had to parse a view plan and resolve relations within it using the view’s
context (current database). In contrast, John’s new view substitution rules
are cleaner and can stay within the substitution batch.

People implementing views would also not benefit from combining the two
interfaces:

   - There is little overlap between View and Table, only schema and
   properties
   - Most catalogs won’t implement both interfaces, so returning a
   ViewOrTable is more difficult for implementations
   - TableCatalog assumes that ViewCatalog will be added separately like
   John proposes, so we would have to break or replace that API

I understand the initial appeal of combining TableCatalog and ViewCatalog
since it is done that way in the existing interfaces. But I think that Hive
chose to do that mostly on the fact that the two were already stored
together, and not because it made sense for users of the API, or any other
implementer of the API.

rb

On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzhuge@apache.org> wrote:

>
>
>
>> > AFAIK view schema is only used by DESCRIBE.
>>
>> Correction: Spark adds a new Project at the top of the parsed plan from
>> view, based on the stored schema, to make sure the view schema doesn't
>> change.
>>
>
> Thanks Wenchen! I thought I forgot something :) Yes it is the validation
> done in *checkAnalysis*:
>
>           // If the view output doesn't have the same number of columns
> neither with the child
>           // output, nor with the query column names, throw an
> AnalysisException.
>           // If the view's child output can't up cast to the view output,
>           // throw an AnalysisException, too.
>
> The view output comes from the schema:
>
>       val child = View(
>         desc = metadata,
>         output = metadata.schema.toAttributes,
>         child = parser.parsePlan(viewText))
>
> So it is a validation (here) or cache (in DESCRIBE) nice to have but not
> "required" or "should be frozen". Thanks Ryan and Burak for pointing that
> out in SPIP. I will add a new paragraph accordingly.
>


-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message