spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Zhuge <jzh...@apache.org>
Subject Re: SPIP: Catalog API for view metadata
Date Fri, 04 Sep 2020 08:20:55 GMT
SPIP
<https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
has been updated. Please review.

On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jzhuge@apache.org> wrote:

> Wenchen, sorry for the delay, I will post an update shortly.
>
> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>> Any updates here? I agree that a new View API is better, but we need a
>> solution to avoid performance regression. We need to elaborate on the cache
>> idea.
>>
>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rblue@netflix.com> wrote:
>>
>>> I think it is a good idea to keep tables and views separate.
>>>
>>> The main two arguments I’ve heard for combining lookup into a single
>>> function are the ones brought up in this thread. First, an identifier in a
>>> catalog must be either a view or a table and should not collide. Second, a
>>> single lookup is more likely to require a single RPC. I think the RPC
>>> concern is well addressed by caching, which we already do in the Spark
>>> catalog, so I’ll primarily focus on the first.
>>>
>>> Table/view name collision is unlikely to be a problem. Metastores that
>>> support both today store them in a single namespace, so this is not a
>>> concern for even a naive implementation that talks to the Hive MetaStore. I
>>> know that a new metastore catalog could choose to implement both
>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>> would be a very strange choice: if the metastore itself has different
>>> namespaces for tables and views, then it makes much more sense to expose
>>> them through separate catalogs because Spark will always prefer one over
>>> the other.
>>>
>>> In a similar line of reasoning, catalogs that expose both views and
>>> tables are much more rare than catalogs that only expose one. For example,
>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>> and implementing ViewCatalog would make little sense. Exposing new data
>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>>> likely to be the same. Say I have a way to convert Pig statements or some
>>> other representation into a SQL view. It would make little sense to combine
>>> that with some other TableCatalog.
>>>
>>> I also don’t think there is benefit from an API perspective to justify
>>> combining the Table and View interfaces. The two share only schema and
>>> properties, and are handled very differently internally — a View’s SQL
>>> query is parsed and substituted into the plan, while a Table is wrapped in
>>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>>> SQL also needs additional context to be resolved correctly: the current
>>> catalog and namespace from the time the view was created.
>>>
>>> Query planning is distinct between tables and views, so Spark doesn’t
>>> benefit from combining them. I think it has actually caused problems that
>>> both were resolved by the same method in v1: the resolution rule grew
>>> extremely complicated trying to look up a reference just once because it
>>> had to parse a view plan and resolve relations within it using the view’s
>>> context (current database). In contrast, John’s new view substitution rules
>>> are cleaner and can stay within the substitution batch.
>>>
>>> People implementing views would also not benefit from combining the two
>>> interfaces:
>>>
>>>    - There is little overlap between View and Table, only schema and
>>>    properties
>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>    ViewOrTable is more difficult for implementations
>>>    - TableCatalog assumes that ViewCatalog will be added separately
>>>    like John proposes, so we would have to break or replace that API
>>>
>>> I understand the initial appeal of combining TableCatalog and
>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>> think that Hive chose to do that mostly on the fact that the two were
>>> already stored together, and not because it made sense for users of the
>>> API, or any other implementer of the API.
>>>
>>> rb
>>>
>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzhuge@apache.org> wrote:
>>>
>>>>
>>>>
>>>>
>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>
>>>>> Correction: Spark adds a new Project at the top of the parsed plan
>>>>> from view, based on the stored schema, to make sure the view schema doesn't
>>>>> change.
>>>>>
>>>>
>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>> validation done in *checkAnalysis*:
>>>>
>>>>           // If the view output doesn't have the same number of columns
>>>> neither with the child
>>>>           // output, nor with the query column names, throw an
>>>> AnalysisException.
>>>>           // If the view's child output can't up cast to the view
>>>> output,
>>>>           // throw an AnalysisException, too.
>>>>
>>>> The view output comes from the schema:
>>>>
>>>>       val child = View(
>>>>         desc = metadata,
>>>>         output = metadata.schema.toAttributes,
>>>>         child = parser.parsePlan(viewText))
>>>>
>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but
>>>> not "required" or "should be frozen". Thanks Ryan and Burak for pointing
>>>> that out in SPIP. I will add a new paragraph accordingly.
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> John Zhuge
>


-- 
John Zhuge

Mime
View raw message