spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: SPIP: Catalog API for view metadata
Date Tue, 10 Nov 2020 04:18:43 GMT
Moving back the discussion to this thread. The current argument is how to
avoid extra RPC calls for catalogs supporting both table and view. There
are several options:
1. ignore it as extra PRC calls are cheap compared to the query execution
2. have a per session cache for loaded table/view
3. have a per query cache for loaded table/view
4. add a new trait TableViewCatalog

I think it's important to avoid perf regression with new APIs. RPC calls
can be significant for short queries. We may also double the RPC
traffic which is bad for the metastore service. Normally I would not
recommend caching as cache invalidation is a hard problem. Personally I
prefer option 4 as it only affects catalogs that support both table and
view, and it fits the hive catalog very well.

On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jzhuge@apache.org> wrote:

> SPIP
> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
> has been updated. Please review.
>
> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jzhuge@apache.org> wrote:
>
>> Wenchen, sorry for the delay, I will post an update shortly.
>>
>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>>> Any updates here? I agree that a new View API is better, but we need a
>>> solution to avoid performance regression. We need to elaborate on the cache
>>> idea.
>>>
>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rblue@netflix.com> wrote:
>>>
>>>> I think it is a good idea to keep tables and views separate.
>>>>
>>>> The main two arguments I’ve heard for combining lookup into a single
>>>> function are the ones brought up in this thread. First, an identifier in
a
>>>> catalog must be either a view or a table and should not collide. Second,
a
>>>> single lookup is more likely to require a single RPC. I think the RPC
>>>> concern is well addressed by caching, which we already do in the Spark
>>>> catalog, so I’ll primarily focus on the first.
>>>>
>>>> Table/view name collision is unlikely to be a problem. Metastores that
>>>> support both today store them in a single namespace, so this is not a
>>>> concern for even a naive implementation that talks to the Hive MetaStore.
I
>>>> know that a new metastore catalog could choose to implement both
>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>> would be a very strange choice: if the metastore itself has different
>>>> namespaces for tables and views, then it makes much more sense to expose
>>>> them through separate catalogs because Spark will always prefer one over
>>>> the other.
>>>>
>>>> In a similar line of reasoning, catalogs that expose both views and
>>>> tables are much more rare than catalogs that only expose one. For example,
>>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>>>> likely to be the same. Say I have a way to convert Pig statements or some
>>>> other representation into a SQL view. It would make little sense to combine
>>>> that with some other TableCatalog.
>>>>
>>>> I also don’t think there is benefit from an API perspective to justify
>>>> combining the Table and View interfaces. The two share only schema and
>>>> properties, and are handled very differently internally — a View’s SQL
>>>> query is parsed and substituted into the plan, while a Table is wrapped in
>>>> a relation that eventually becomes a Scan node using SupportsRead. A view’s
>>>> SQL also needs additional context to be resolved correctly: the current
>>>> catalog and namespace from the time the view was created.
>>>>
>>>> Query planning is distinct between tables and views, so Spark doesn’t
>>>> benefit from combining them. I think it has actually caused problems that
>>>> both were resolved by the same method in v1: the resolution rule grew
>>>> extremely complicated trying to look up a reference just once because it
>>>> had to parse a view plan and resolve relations within it using the view’s
>>>> context (current database). In contrast, John’s new view substitution rules
>>>> are cleaner and can stay within the substitution batch.
>>>>
>>>> People implementing views would also not benefit from combining the two
>>>> interfaces:
>>>>
>>>>    - There is little overlap between View and Table, only schema and
>>>>    properties
>>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>>    ViewOrTable is more difficult for implementations
>>>>    - TableCatalog assumes that ViewCatalog will be added separately
>>>>    like John proposes, so we would have to break or replace that API
>>>>
>>>> I understand the initial appeal of combining TableCatalog and
>>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>>> think that Hive chose to do that mostly on the fact that the two were
>>>> already stored together, and not because it made sense for users of the
>>>> API, or any other implementer of the API.
>>>>
>>>> rb
>>>>
>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzhuge@apache.org> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>>
>>>>>> Correction: Spark adds a new Project at the top of the parsed plan
>>>>>> from view, based on the stored schema, to make sure the view schema
doesn't
>>>>>> change.
>>>>>>
>>>>>
>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>>> validation done in *checkAnalysis*:
>>>>>
>>>>>           // If the view output doesn't have the same number of
>>>>> columns neither with the child
>>>>>           // output, nor with the query column names, throw an
>>>>> AnalysisException.
>>>>>           // If the view's child output can't up cast to the view
>>>>> output,
>>>>>           // throw an AnalysisException, too.
>>>>>
>>>>> The view output comes from the schema:
>>>>>
>>>>>       val child = View(
>>>>>         desc = metadata,
>>>>>         output = metadata.schema.toAttributes,
>>>>>         child = parser.parsePlan(viewText))
>>>>>
>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but
>>>>> not "required" or "should be frozen". Thanks Ryan and Burak for pointing
>>>>> that out in SPIP. I will add a new paragraph accordingly.
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> John Zhuge
>>
>
>
> --
> John Zhuge
>

Mime
View raw message