spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Zhuge <jzh...@apache.org>
Subject Re: SPIP: Catalog API for view metadata
Date Thu, 03 Sep 2020 16:22:04 GMT
Wenchen, sorry for the delay, I will post an update shortly.

On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cloud0fan@gmail.com> wrote:

> Any updates here? I agree that a new View API is better, but we need a
> solution to avoid performance regression. We need to elaborate on the cache
> idea.
>
> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rblue@netflix.com> wrote:
>
>> I think it is a good idea to keep tables and views separate.
>>
>> The main two arguments I’ve heard for combining lookup into a single
>> function are the ones brought up in this thread. First, an identifier in a
>> catalog must be either a view or a table and should not collide. Second, a
>> single lookup is more likely to require a single RPC. I think the RPC
>> concern is well addressed by caching, which we already do in the Spark
>> catalog, so I’ll primarily focus on the first.
>>
>> Table/view name collision is unlikely to be a problem. Metastores that
>> support both today store them in a single namespace, so this is not a
>> concern for even a naive implementation that talks to the Hive MetaStore. I
>> know that a new metastore catalog could choose to implement both
>> ViewCatalog and TableCatalog and store the two sets separately, but that
>> would be a very strange choice: if the metastore itself has different
>> namespaces for tables and views, then it makes much more sense to expose
>> them through separate catalogs because Spark will always prefer one over
>> the other.
>>
>> In a similar line of reasoning, catalogs that expose both views and
>> tables are much more rare than catalogs that only expose one. For example,
>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>> and implementing ViewCatalog would make little sense. Exposing new data
>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>> likely to be the same. Say I have a way to convert Pig statements or some
>> other representation into a SQL view. It would make little sense to combine
>> that with some other TableCatalog.
>>
>> I also don’t think there is benefit from an API perspective to justify
>> combining the Table and View interfaces. The two share only schema and
>> properties, and are handled very differently internally — a View’s SQL
>> query is parsed and substituted into the plan, while a Table is wrapped in
>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>> SQL also needs additional context to be resolved correctly: the current
>> catalog and namespace from the time the view was created.
>>
>> Query planning is distinct between tables and views, so Spark doesn’t
>> benefit from combining them. I think it has actually caused problems that
>> both were resolved by the same method in v1: the resolution rule grew
>> extremely complicated trying to look up a reference just once because it
>> had to parse a view plan and resolve relations within it using the view’s
>> context (current database). In contrast, John’s new view substitution rules
>> are cleaner and can stay within the substitution batch.
>>
>> People implementing views would also not benefit from combining the two
>> interfaces:
>>
>>    - There is little overlap between View and Table, only schema and
>>    properties
>>    - Most catalogs won’t implement both interfaces, so returning a
>>    ViewOrTable is more difficult for implementations
>>    - TableCatalog assumes that ViewCatalog will be added separately like
>>    John proposes, so we would have to break or replace that API
>>
>> I understand the initial appeal of combining TableCatalog and ViewCatalog
>> since it is done that way in the existing interfaces. But I think that Hive
>> chose to do that mostly on the fact that the two were already stored
>> together, and not because it made sense for users of the API, or any other
>> implementer of the API.
>>
>> rb
>>
>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzhuge@apache.org> wrote:
>>
>>>
>>>
>>>
>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>
>>>> Correction: Spark adds a new Project at the top of the parsed plan from
>>>> view, based on the stored schema, to make sure the view schema doesn't
>>>> change.
>>>>
>>>
>>> Thanks Wenchen! I thought I forgot something :) Yes it is the validation
>>> done in *checkAnalysis*:
>>>
>>>           // If the view output doesn't have the same number of columns
>>> neither with the child
>>>           // output, nor with the query column names, throw an
>>> AnalysisException.
>>>           // If the view's child output can't up cast to the view output,
>>>           // throw an AnalysisException, too.
>>>
>>> The view output comes from the schema:
>>>
>>>       val child = View(
>>>         desc = metadata,
>>>         output = metadata.schema.toAttributes,
>>>         child = parser.parsePlan(viewText))
>>>
>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but not
>>> "required" or "should be frozen". Thanks Ryan and Burak for pointing that
>>> out in SPIP. I will add a new paragraph accordingly.
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
John Zhuge

Mime
View raw message