spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: SQL DDL statements with replacing default catalog with custom catalog
Date Wed, 07 Oct 2020 16:30:33 GMT
I disagree that this is “by design”. An operation like DROP TABLE should
use a v2 drop plan if the table is v2.

If a v2 table is loaded or created using a v2 catalog it should also be
dropped that way. Otherwise, the v2 catalog is not notified when the table
is dropped and can’t perform other necessary updates, like invalidating
caches or dropping state outside of Hive. V2 tables should always use the
v2 API, and I’m not aware of a design where that wasn’t the case.

I’d also say that for DROP TABLE in particular, all calls could use the v2
catalog. We may not want to do this until we are confident as Wenchen said,
but this would be the simpler solution. The v2 catalog can delegate to the
old session catalog, after all.

On Wed, Oct 7, 2020 at 3:48 AM Wenchen Fan <cloud0fan@gmail.com> wrote:

> If you just want to save typing the catalog name when writing table names,
> you can set your custom catalog as the default catalog (See
> SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is
> used to extend the v1 session catalog, not replace it.
>
> On Wed, Oct 7, 2020 at 5:36 PM Jungtaek Lim <kabhwan.opensource@gmail.com>
> wrote:
>
>> If it's by design and not prepared, then IMHO replacing the default
>> session catalog is better to be restricted until things are sorted out, as
>> it gives pretty much confusion and has known bugs. Actually there's another
>> bug/limitation on default session catalog on the length of identifier,
>> so things that work with custom catalog no longer work when it replaces
>> default session catalog.
>>
>> On Wed, Oct 7, 2020 at 6:05 PM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>>> Ah, this is by design. V1 tables should still go through the v1 session
>>> catalog. I think we can remove this restriction when we are confident about
>>> the new v2 DDL commands that work with v2 catalog APIs.
>>>
>>> On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it
>>>> simply works when I use custom catalog without replacing the default
>>>> catalog).
>>>>
>>>> It just fails on v2 when the "default catalog" is replaced (say I
>>>> replace 'spark_catalog'), because TempViewOrV1Table is providing value even
>>>> with v2 table, and then the catalyst goes with v1 exec. I guess all
>>>> commands leveraging TempViewOrV1Table to determine whether the table is v1
>>>> vs v2 would all suffer from this issue.
>>>>
>>>> On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>>>
>>>>> Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE
>>>>> LIKE), so it's possible that some commands still go through the v1 session
>>>>> catalog although you configured a custom v2 session catalog.
>>>>>
>>>>> Can you create JIRA tickets if you hit any DDL commands that don't
>>>>> support v2 catalog? We should fix them.
>>>>>
>>>>> On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <
>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>
>>>>>> The logical plan for the parsed statement is getting converted either
>>>>>> for old one or v2, and for the former one it keeps using an external
>>>>>> catalog (Hive) - so replacing default session catalog with custom
one and
>>>>>> trying to use it like it is in external catalog doesn't work, which
>>>>>> destroys the purpose of replacing the default session catalog.
>>>>>>
>>>>>> Btw I see one approach: in TempViewOrV1Table, if it matches
>>>>>> with SessionCatalogAndIdentifier where the catalog is TableCatalog,
call
>>>>>> loadTable in catalog and see whether it's V1 table or not. Not sure
it's a
>>>>>> viable approach though, as it requires loading a table during resolution
of
>>>>>> the table identifier.
>>>>>>
>>>>>> On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue <rblue@netflix.com>
wrote:
>>>>>>
>>>>>>> I've hit this with `DROP TABLE` commands that should be passed
to a
>>>>>>> registered v2 session catalog, but are handled by v1. I think
that's the
>>>>>>> only case we hit in our downstream test suites, but we haven't
been
>>>>>>> exploring the use of a session catalog for fallback. We use v2
for
>>>>>>> everything now, which avoids the problem and comes with multi-catalog
>>>>>>> support.
>>>>>>>
>>>>>>> On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <
>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi devs,
>>>>>>>>
>>>>>>>> I'm not sure whether it's addressed in Spark 3.1, but at
least from
>>>>>>>> Spark 3.0.1, many SQL DDL statements don't seem to go through
the custom
>>>>>>>> catalog when I replace default catalog with custom catalog
and only provide
>>>>>>>> 'dbName.tableName' as table identifier.
>>>>>>>>
>>>>>>>> I'm not an expert in this area, but after skimming the code
I feel
>>>>>>>> TempViewOrV1Table looks to be broken for the case, as it
can still be a V2
>>>>>>>> table. Classifying the table identifier to either V2 table
or "temp view or
>>>>>>>> v1 table" looks to be mandatory, as former and latter have
different code
>>>>>>>> paths and different catalog interfaces.
>>>>>>>>
>>>>>>>> That sounds to me as being stuck and the only "clear" approach
>>>>>>>> seems to disallow default catalog with custom one. Am I missing
something?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message