spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: SPIP: Catalog API for view metadata
Date Wed, 11 Nov 2020 01:21:52 GMT
An extra RPC call is a concern for the catalog implementation. It is simple
to cache the result of a call to avoid a second one if the catalog chooses.

I don't think that an extra RPC that can be easily avoided is a reasonable
justification to add caches in Spark. For one thing, it doesn't solve the
problem because the proposed API still requires separate lookups for tables
and views.

The only solution that would help is to use a combined trait, but that has
issues. For one, view substitution is much cleaner when it happens well
before table resolution. And, View and Table are very different objects;
returning Object from this API doesn't make much sense.

One extra RPC is not unreasonable, and the choice should be left to
sources. That's the easiest place to cache results from the underlying
store.

On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <cloud0fan@gmail.com> wrote:

> Moving back the discussion to this thread. The current argument is how to
> avoid extra RPC calls for catalogs supporting both table and view. There
> are several options:
> 1. ignore it as extra PRC calls are cheap compared to the query execution
> 2. have a per session cache for loaded table/view
> 3. have a per query cache for loaded table/view
> 4. add a new trait TableViewCatalog
>
> I think it's important to avoid perf regression with new APIs. RPC calls
> can be significant for short queries. We may also double the RPC
> traffic which is bad for the metastore service. Normally I would not
> recommend caching as cache invalidation is a hard problem. Personally I
> prefer option 4 as it only affects catalogs that support both table and
> view, and it fits the hive catalog very well.
>
> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jzhuge@apache.org> wrote:
>
>> SPIP
>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>> has been updated. Please review.
>>
>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jzhuge@apache.org> wrote:
>>
>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>
>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>>
>>>> Any updates here? I agree that a new View API is better, but we need a
>>>> solution to avoid performance regression. We need to elaborate on the cache
>>>> idea.
>>>>
>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rblue@netflix.com> wrote:
>>>>
>>>>> I think it is a good idea to keep tables and views separate.
>>>>>
>>>>> The main two arguments I’ve heard for combining lookup into a single
>>>>> function are the ones brought up in this thread. First, an identifier
in a
>>>>> catalog must be either a view or a table and should not collide. Second,
a
>>>>> single lookup is more likely to require a single RPC. I think the RPC
>>>>> concern is well addressed by caching, which we already do in the Spark
>>>>> catalog, so I’ll primarily focus on the first.
>>>>>
>>>>> Table/view name collision is unlikely to be a problem. Metastores that
>>>>> support both today store them in a single namespace, so this is not a
>>>>> concern for even a naive implementation that talks to the Hive MetaStore.
I
>>>>> know that a new metastore catalog could choose to implement both
>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>>> would be a very strange choice: if the metastore itself has different
>>>>> namespaces for tables and views, then it makes much more sense to expose
>>>>> them through separate catalogs because Spark will always prefer one over
>>>>> the other.
>>>>>
>>>>> In a similar line of reasoning, catalogs that expose both views and
>>>>> tables are much more rare than catalogs that only expose one. For example,
>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs
are
>>>>> likely to be the same. Say I have a way to convert Pig statements or
some
>>>>> other representation into a SQL view. It would make little sense to combine
>>>>> that with some other TableCatalog.
>>>>>
>>>>> I also don’t think there is benefit from an API perspective to justify
>>>>> combining the Table and View interfaces. The two share only schema and
>>>>> properties, and are handled very differently internally — a View’s
SQL
>>>>> query is parsed and substituted into the plan, while a Table is wrapped
in
>>>>> a relation that eventually becomes a Scan node using SupportsRead. A
view’s
>>>>> SQL also needs additional context to be resolved correctly: the current
>>>>> catalog and namespace from the time the view was created.
>>>>>
>>>>> Query planning is distinct between tables and views, so Spark doesn’t
>>>>> benefit from combining them. I think it has actually caused problems
that
>>>>> both were resolved by the same method in v1: the resolution rule grew
>>>>> extremely complicated trying to look up a reference just once because
it
>>>>> had to parse a view plan and resolve relations within it using the view’s
>>>>> context (current database). In contrast, John’s new view substitution
rules
>>>>> are cleaner and can stay within the substitution batch.
>>>>>
>>>>> People implementing views would also not benefit from combining the
>>>>> two interfaces:
>>>>>
>>>>>    - There is little overlap between View and Table, only schema and
>>>>>    properties
>>>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>>>    ViewOrTable is more difficult for implementations
>>>>>    - TableCatalog assumes that ViewCatalog will be added separately
>>>>>    like John proposes, so we would have to break or replace that API
>>>>>
>>>>> I understand the initial appeal of combining TableCatalog and
>>>>> ViewCatalog since it is done that way in the existing interfaces. But
I
>>>>> think that Hive chose to do that mostly on the fact that the two were
>>>>> already stored together, and not because it made sense for users of the
>>>>> API, or any other implementer of the API.
>>>>>
>>>>> rb
>>>>>
>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzhuge@apache.org>
wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>>>
>>>>>>> Correction: Spark adds a new Project at the top of the parsed
plan
>>>>>>> from view, based on the stored schema, to make sure the view
schema doesn't
>>>>>>> change.
>>>>>>>
>>>>>>
>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>>>> validation done in *checkAnalysis*:
>>>>>>
>>>>>>           // If the view output doesn't have the same number of
>>>>>> columns neither with the child
>>>>>>           // output, nor with the query column names, throw an
>>>>>> AnalysisException.
>>>>>>           // If the view's child output can't up cast to the view
>>>>>> output,
>>>>>>           // throw an AnalysisException, too.
>>>>>>
>>>>>> The view output comes from the schema:
>>>>>>
>>>>>>       val child = View(
>>>>>>         desc = metadata,
>>>>>>         output = metadata.schema.toAttributes,
>>>>>>         child = parser.parsePlan(viewText))
>>>>>>
>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have
but
>>>>>> not "required" or "should be frozen". Thanks Ryan and Burak for pointing
>>>>>> that out in SPIP. I will add a new paragraph accordingly.
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message