spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <>
Subject Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s
Date Fri, 20 Mar 2020 03:06:25 GMT
Anything would be OK if the create table DDL provides a "clear way" to
expect the table provider "before" they run the query. Great news that it
doesn't require major rework - looking forward to the PR.

Thanks again to jump in and sort this out.

- Jungtaek Lim (HeartSaVioR)

On Fri, Mar 20, 2020 at 9:36 AM Ryan Blue <> wrote:

> I have an update to the parser that unifies the CREATE TABLE rules. It
> took surprisingly little work to get the parser updated to produce
> CreateTableStatement and CreateTableAsSelectStatement with the Hive info.
> And the only fields I need to add to those statements were serde:
> SerdeInfo and external: Boolean.
> From there, we can use the conversion rules to re-create the same Hive
> command for v1 or pass the data as properties for v2. I’ll work on getting
> this cleaned up and open a PR hopefully tomorrow.
> For the questions about how this gets converted to either a Spark or Hive
> create table command, that is really up to analyzer rules and
> configuration. With my changes, it is no longer determined by the parser:
> the parser just produces a node that includes all of the user options and
> Spark decides what to do with that in the analyzer. Also, there's already
> an option to convert Hive syntax to a Spark
> command, spark.sql.hive.convertCTAS.
> rb
> On Thu, Mar 19, 2020 at 12:46 AM Wenchen Fan <> wrote:
>> Big +1 to have one single unified CREATE TABLE syntax.
>> In general, we can say there are 2 ways to specify the table provider:
>> USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually
>> exclusive. If none is specified, it implicitly indicates USING
>> defaultSource.
>> I'm fine with a few special cases which can indicate the table provider,
>> like EXTERNAL indicates Hive Serde table. A few thoughts:
>> 1. SKEWED BY ...: We support it in Hive syntax just to fail it with a
>> nice error message. We can support it in the unified syntax as well, and
>> fail it.
>> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
>> syntax. Just make sure it doesn't appear together with PARTITIONED BY
>> transformList.
>> 3. OPTIONS: We can either map it to Hive Serde properties, or let it
>> indicate non-Hive tables.
>> On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim <
>>> wrote:
>>> Thanks Nicholas for the side comment; you'll need to interpret "CREATE
>>> may add the confusion.
>>> Ryan, thanks for the detailed analysis and proposal. That's what I would
>>> like to see in discussion thread.
>>> I'm open to solutions which enable end users to specify their intention
>>> properly - my main concern of SPARK-30098 is that it becomes unclear which
>>> provider the query will use in create table unless USING provider is
>>> explicitly specified. If the new proposal makes clear on this, that should
>>> be better than now.
>>> Replying inline:
>>> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
>>>> wrote:
>>>> Side comment: The current docs for CREATE TABLE
>>>> <>
>>>> add to the confusion by describing the Hive-compatible command as "CREATE
>>>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
>>>> actually part of the syntax
>>>> <>
>>>> .
>>>> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <>
>>>> wrote:
>>>>> Jungtaek, it sounds like you consider the two rules to be separate
>>>>> syntaxes with their own consistency rules. For example, if I am using
>>>>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
>>>>> columns and requires types for those columns; if I’m using the Spark
>>>>> rule with USING then PARTITIONED BY must reference existing columns
>>>>> and cannot include types.
>>>>> I agree that this is confusing to users! We should fix it, but I don’t
>>>>> think the right solution is to continue to have two rules with divergent
>>>>> syntax.
>>>>> This is confusing to users because they don’t know anything about
>>>>> separate parser rules. All the user sees is that sometimes PARTITION
>>>>> BY requires types and sometimes it doesn’t. Yes, we could add a
>>>>> keyword, HIVE, to signal that the syntax is borrowed from Hive for
>>>>> that case, but that actually breaks queries that run in Hive.
>>>> That might less matter, because SPARK-30098 (and I guess your proposal
>>> as well) enforces end users to add "USING HIVE" for their queries to enable
>>> Hive provider in any way, even only when the query matches with rule 1
>>> (conditional). Once they decide to create Hive table, the query might have
>>> to be changed, or they have to change the default provider, or they have to
>>> enable legacy config.
>>>> I think the right solution is to unify the two syntaxes. I don’t think
>>>>> they are so different that it isn’t possible. Here are the differences
>>>>> see:
>>>>>    - Only in Hive:
>>>>>       - EXTERNAL
>>>>>       - skewSpec: SKEWED BY ...
>>>>>       - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>>>>>       - createFileFormat: STORED AS ...
>>>>>    - Only in Spark:
>>>>>       - OPTIONS property list
>>>>>    - Different syntax/interpretation:
>>>>>       - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>>>>> ":" after column name is another one only supported in Hive, though
>>> that's relatively minor to support it in unified syntax.
>>>>>    -
>>>>> For the clauses that are supported in one but not the other, we can
>>>>> add them to a unified rule as optional clauses. The AST builder would
>>>>> validate what makes sense or not (e.g., stored as with using or row format
>>>>> delimited) and finally pass the remaining data on using the
>>>>> CreateTableStatement. That statement would be handled like we do for
>>>>> the Spark rule today, but with extra metadata to pass along. This is
also a
>>>>> step toward being able to put Hive behind the DSv2 API because we’d
be able
>>>>> to pass all of the Hive metadata clauses to the v2 catalog.
>>>>> The only difficult part is handling PARTITIONED BY. But in that case,
>>>>> we can use two different syntaxes from the same CREATE TABLE rule. If
>>>>> types are included, we use the Hive PARTITIONED BY syntax and convert
>>>>> in the AST builder to normalize to a single representation.
>>>> The proposal looks promising - it may add some sort of complexity but
>>> sounds like worth adding.
>>> One thing to make clear - in unified syntax we only rely on explicit
>>> provider, or default provider, right? I would concern if the proposal
>>> automatically uses Hive provider if Hive specific clauses are being used.
>>> Yes as I said earlier it may make end users' query to be changed, but
>>> better than uncertain.
>>> Btw, if the main purpose to add native syntax and change it by default
>>> is to discontinue supporting Hive create table rule sooner, simply dropping
>>> rule 2 with providing legacy config is still one of valid options I think.
>>>> What do you both think? This would make the behavior more clear and
>>>>> take a step toward getting rid of Hive-specific code.
>>>>> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <
>>>>>> wrote:
>>>>>> I'm trying to understand the reason you have been suggesting to keep
>>>>>> the real thing unchanged but change doc instead. Could you please
>>>>>> why? End users would blame us when they hit the case their query
>>>>>> work as intended (1) and found the fact it's undocumented (2) and
hard to
>>>>>> understand even from the Spark codebase (3).
>>>>>> For me, addressing the root issue adopting your suggestion would
>>>>>> "dropping the rule 2" and only supporting it with legacy config on.
>>>>>> would say to end users, you need to enable the legacy config to leverage
>>>>>> Hive create table syntax, or just use beeline with Hive connected.
>>>>>> But since we are even thinking about native syntax as a first class
>>>>>> and dropping Hive one implicitly (hide in doc) or explicitly, does
>>>>>> really matter we require a marker (like "HIVE") in rule 2 and isolate
>>>>>> It would have even less confusion than Spark 2.x, since we will require
>>>>>> users to fill the marker regarding Hive when creating Hive table,
easier to
>>>>>> classify than "USING provider".
>>>>>> If we think native syntax would cover many cases end users have been
>>>>>> creating Hive table in Spark (say, USING hive would simply work for
>>>>>> I'm OK to drop the rule 2 and lead end users to enable the legacy
config if
>>>>>> really needed. If not, let's continue "fixing" the issue.
>>>>>> (Another valid approach would be consolidating two rules into one,
>>>>>> and defining support of parameters per provider, e.g. EXTERNAL, STORED
>>>>>> ROW FORMAT, etc. are only supported in Hive provider.)
>>>>>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <>
>>>>>> wrote:
>>>>>>> The fact that we have 2 CREATE TABLE syntax is already confusing
>>>>>>> many users. Shall we only document the native syntax? Then users
don't need
>>>>>>> to worry about which rule their query fits and they don't need
to spend a
>>>>>>> lot of time understanding the subtle difference between these
2 syntaxes.
>>>>>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>>>>>>>> wrote:
>>>>>>>> A bit correction: the example I provided for vice versa is
>>>>>>>> really a correct case for vice versa. It's actually same
case (intended to
>>>>>>>> use rule 2 which is not default) but different result.
>>>>>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>>>>>>>>> wrote:
>>>>>>>>> My concern is that although we simply think about the
change to
>>>>>>>>> mark "USING provider" to be optional in rule 1, but in
reality the change
>>>>>>>>> is most likely swapping the default rule for CREATE TABLE,
which was "rule
>>>>>>>>> 2", and now "rule 1" (it would be the happy case of migration
doc if the
>>>>>>>>> swap happens as intended), and there're still couple
of things which make
>>>>>>>>> the query still fall into rule 2 which is non-trivial
to reason about and
>>>>>>>>> also not easy to explain.
>>>>>>>>> I only mentioned ROW FORMAT and STORED AS as the case
to fall into
>>>>>>>>> rule 2 to simplify the problem statement, but they're
not only the case
>>>>>>>>> - using col_name1:col_type1 would make the query fall
into rule 2
>>>>>>>>> regardless of any other properties.
>>>>>>>>> What's the matter? In Spark 2.x, if end users want to
use rule 1
>>>>>>>>> (which is not default) and specify the parameters which
are only available
>>>>>>>>> in rule 1, it clearly requires "USING provider" - parser
will throw error
>>>>>>>>> if there're any mistakes on specifying parameters. End
users could set
>>>>>>>>> intention clearly which rule the query should be bound.
If the query fails
>>>>>>>>> to bind the rule as intended, it's simply denied.
>>>>>>>>> In Spark 3.x, parser may not help figuring out the case
where end
>>>>>>>>> users intend to use rule 2 (which is not default) but
some mistake on
>>>>>>>>> specifying parameters - it could just "silently" bound
into rule 1 and it
>>>>>>>>> may be even executed without any error. Vice versa happens,
but in odd way
>>>>>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with
weird message CREATE
>>>>>>>>> EXTERNAL TABLE is not supported.
>>>>>>>>> It's deterministic for end users only if they're fully
>>>>>>>>> the difference between the twos and also understand how
Spark would apply
>>>>>>>>> the rules to make the query fall into one of the rule.
I'm not sure how we
>>>>>>>>> can only improve documentation to make things be clear,
but if the approach
>>>>>>>>> would be explaining the difference of rules and guide
the tip to make the
>>>>>>>>> query be bound to the specific rule, the same could be
applied to parser
>>>>>>>>> rule to address the root cause.
>>>>>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <>
>>>>>>>>> wrote:
>>>>>>>>>> Document-wise, yes, it's confusing as a simple CREATE
TABLE fits
>>>>>>>>>> both native and Hive syntax. I'm fine with some changes
to make it less
>>>>>>>>>> confusing, as long as the user-facing behavior doesn't
change. For example,
>>>>>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only
if the legacy config
>>>>>>>>>> is false.
>>>>>>>>>> I still don't get your point about what's the real
problem to end
>>>>>>>>>> users. There is no ambiguity as the behavior is deterministic,
although we
>>>>>>>>>> rely on optional fields and rule order which is bad.
It's hard to document,
>>>>>>>>>> but I don't think that's a big problem to end users.
>>>>>>>>>> For the legacy config, it does make the implementation
>>>>>>>>>> complicated, but it's invisible to most end users
(we don't document it)
>>>>>>>>>> and can be super useful to some users who want the
queries to keep working
>>>>>>>>>> in 3.0 without rewriting.
>>>>>>>>>> If your only concern is documentation, I totally
agree that we
>>>>>>>>>> should improve it.
>>>>>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>>>>>>>>> wrote:
>>>>>>>>>>> Thanks for sharing your view.
>>>>>>>>>>> I agree with you it's good for Spark to promote
Spark's own
>>>>>>>>>>> CREATE TABLE syntax. The thing is, we still leave
Hive CREATE TABLE syntax
>>>>>>>>>>> unchanged - it's being said as "convenience"
but I'm not sure I can agree
>>>>>>>>>>> with.
>>>>>>>>>>> I'll quote my comments in SPARK-31136 here again
to make the
>>>>>>>>>>> problem statement be clearer:
>>>>>>>>>>> I think the parser implementation around CREATE
TABLE brings
>>>>>>>>>>> ambiguity which is not documented anywhere. It
wasn't ambiguous because we
>>>>>>>>>>> forced to specify USE provider if it's not a
Hive table. Now it's either
>>>>>>>>>>> default provider or Hive according to which options
are provided, which
>>>>>>>>>>> seems to be non-trivial to reason about. (End
users would never know, as
>>>>>>>>>>> it's completely from parser rule.)
>>>>>>>>>>> I feel this as the issue of "not breaking old
behavior". The
>>>>>>>>>>> parser rule gets pretty much complicated due
to support legacy config. Not
>>>>>>>>>>> breaking anything would make us be stuck eventually.
>>>>>>>>>>> Since Spark 3.0, CREATE TABLE without a specific
provider will
>>>>>>>>>>> use the value of spark.sql.sources.default as
its provider. In Spark
>>>>>>>>>>> version 2.4 and earlier, it was hive. To restore
the behavior before Spark
>>>>>>>>>>> 3.0, you can set spark.sql.legacy.createHiveTableByDefault.enabled
to true.
>>>>>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are
provided, and we
>>>>>>>>>>> didn't describe anything for this.
>>>>>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier
[ ( col_name1
>>>>>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING
data_source] [ OPTIONS (
>>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY
( col_name1, col_name2, ...
>>>>>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ...
) [ SORTED BY ( col_name [
>>>>>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS
] [ LOCATION path ] [
>>>>>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1,
key2=val2, ... ) ] [
>>>>>>>>>>> AS select_statement ]
>>>>>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier
[ (
>>>>>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1
], ... ) ] [ COMMENT
>>>>>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:]
col_type2 [ COMMENT
>>>>>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2,
... ) ] [ ROW FORMAT
>>>>>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION
>>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement
>>>>>>>>>>> At least we should describe that parser will
try to match the
>>>>>>>>>>> first case (create table ~ using data source),
and fail back to second
>>>>>>>>>>> case; even though we describe this it's not intuitive
to reason about which
>>>>>>>>>>> rule the DDL query will fall into. As I commented
earlier, "ROW FORMAT" and
>>>>>>>>>>> "STORED AS" are the options which make DDL query
fall into the second case,
>>>>>>>>>>> but they're described as "optional" so it's hard
to catch the gotcha.
>>>>>>>>>>> Furthermore, while we document the syntax as
above, in reality
>>>>>>>>>>> we allow "EXTERNAL" in first rule (and throw
error), which ends up existing
be broken. Even we add "USING
>>>>>>>>>>> hive" parser will throw error. It now requires
>>>>>>>>>>> Simply saying, do we really think end users should
stop and try
>>>>>>>>>>> to match their query against the parser rules
(or orders when we explain in
>>>>>>>>>>> the doc) by themselves to understand which provider
the table will
>>>>>>>>>>> leverage? I'm sorry but I think we are making
bad assumption on end users
>>>>>>>>>>> which is a serious problem.
>>>>>>>>>>> If we really want to promote Spark's one for
>>>>>>>>>>> would it really matter to treat Hive CREATE TABLE
be "exceptional" one and
>>>>>>>>>>> try to isolate each other? What's the point of
providing a legacy config to
>>>>>>>>>>> go back to the old one even we fear about breaking
something to make it
>>>>>>>>>>> better or clearer? We do think that table provider
is important (hence the
>>>>>>>>>>> change was done), then is it still a trivial
problem whether the provider
>>>>>>>>>>> is affected by specifying the "optional" fields?
>>>>>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> I think the general guideline is to promote
Spark's own CREATE
>>>>>>>>>>>> TABLE syntax instead of the Hive one. Previously
these two rules are
>>>>>>>>>>>> mutually exclusive because the native syntax
requires the USING clause
>>>>>>>>>>>> while the Hive syntax makes ROW FORMAT or
STORED AS clause optional.
>>>>>>>>>>>> It's a good move to make the USING clause
optional, which makes
>>>>>>>>>>>> it easier to write the native CREATE TABLE
syntax. Unfortunately, it leads
>>>>>>>>>>>> to some conflicts with the Hive CREATE TABLE
syntax, but I don't see a
>>>>>>>>>>>> serious problem here. If a user just writes
>>>>>>>>>>>> or ROW FORMAT or STORED AS, does it matter
what table we create? Internally
>>>>>>>>>>>> the parser rules conflict and we pick the
native syntax depending on the
>>>>>>>>>>>> rule order. But the user-facing behavior
looks fine.
>>>>>>>>>>>> CREATE EXTERNAL TABLE is a problem as it
works in 2.4 but not
>>>>>>>>>>>> in 3.0. Shall we simply remove EXTERNAL from
the native CREATE TABLE
>>>>>>>>>>>> syntax? Then CREATE EXTERNAL TABLE creates
Hive table like 2.4.
>>>>>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek
Lim <
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>> I'd like to initiate discussion and hear
the voices for
>>>>>>>>>>>>> resolving ambiguous parser rule between
two "create table"s being brought
>>>>>>>>>>>>> by SPARK-30098 [1].
>>>>>>>>>>>>> Previously, "create table" parser rules
were clearly
>>>>>>>>>>>>> distinguished via "USING provider", which
was very intuitive and
>>>>>>>>>>>>> deterministic. Say, DDL query creates
"Hive" table unless "USING provider"
>>>>>>>>>>>>> is specified,
>>>>>>>>>>>>> (Please refer the parser rule in branch-2.4
>>>>>>>>>>>>> After SPARK-30098, "create table" parser
rules became
>>>>>>>>>>>>> ambiguous (please refer the parser rule
in branch-3.0 [3]) - the factors
>>>>>>>>>>>>> differentiating two rules are only "ROW
FORMAT" and "STORED AS" which are
>>>>>>>>>>>>> all defined as "optional". Now it relies
on the "order" of parser rule
>>>>>>>>>>>>> which end users would have no idea to
reason about, and very unintuitive.
>>>>>>>>>>>>> Furthermore, undocumented rule of EXTERNAL
(added in the first
>>>>>>>>>>>>> rule to provide better message) brought
more confusion (I've described the
>>>>>>>>>>>>> broken existing query via SPARK-30436
>>>>>>>>>>>>> Personally I'd like to see two rules
mutually exclusive,
>>>>>>>>>>>>> instead of trying to document the difference
and talk end users to be
>>>>>>>>>>>>> careful about their query. I'm seeing
two ways to make rules be mutually
>>>>>>>>>>>>> exclusive:
>>>>>>>>>>>>> 1. Add some identifier in create Hive
table rule, like `CREATE
>>>>>>>>>>>>> ... "HIVE" TABLE ...`.
>>>>>>>>>>>>> pros. This is the simplest way to distinguish
between two
>>>>>>>>>>>>> rules.
>>>>>>>>>>>>> cons. This would lead end users to change
their query if they
>>>>>>>>>>>>> intend to create Hive table. (Given we
will also provide legacy option I'm
>>>>>>>>>>>>> feeling this is acceptable.)
>>>>>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS"
as mandatory one.
>>>>>>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>>>>>>> cons. Less intuitive, because they have
been optional and now
>>>>>>>>>>>>> become mandatory to fall into the second
>>>>>>>>>>>>> Would like to hear everyone's voices;
better ideas are welcome!
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>>>>> 1. SPARK-30098 Use default datasource
as provider for CREATE
>>>>>>>>>>>>> TABLE syntax
>>>>>>>>>>>>> 2.
>>>>>>>>>>>>> 3.
>>>>>>>>>>>>> 4.
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
> --
> Ryan Blue
> Software Engineer
> Netflix

View raw message