spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <>
Subject Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s
Date Fri, 20 Mar 2020 00:36:00 GMT
I have an update to the parser that unifies the CREATE TABLE rules. It took
surprisingly little work to get the parser updated to produce
CreateTableStatement and CreateTableAsSelectStatement with the Hive info.
And the only fields I need to add to those statements were serde: SerdeInfo
and external: Boolean.

>From there, we can use the conversion rules to re-create the same Hive
command for v1 or pass the data as properties for v2. I’ll work on getting
this cleaned up and open a PR hopefully tomorrow.

For the questions about how this gets converted to either a Spark or Hive
create table command, that is really up to analyzer rules and
configuration. With my changes, it is no longer determined by the parser:
the parser just produces a node that includes all of the user options and
Spark decides what to do with that in the analyzer. Also, there's already
an option to convert Hive syntax to a Spark
command, spark.sql.hive.convertCTAS.


On Thu, Mar 19, 2020 at 12:46 AM Wenchen Fan <> wrote:

> Big +1 to have one single unified CREATE TABLE syntax.
> In general, we can say there are 2 ways to specify the table provider:
> USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually
> exclusive. If none is specified, it implicitly indicates USING
> defaultSource.
> I'm fine with a few special cases which can indicate the table provider,
> like EXTERNAL indicates Hive Serde table. A few thoughts:
> 1. SKEWED BY ...: We support it in Hive syntax just to fail it with a
> nice error message. We can support it in the unified syntax as well, and
> fail it.
> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
> syntax. Just make sure it doesn't appear together with PARTITIONED BY
> transformList.
> 3. OPTIONS: We can either map it to Hive Serde properties, or let it
> indicate non-Hive tables.
> On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim <>
> wrote:
>> Thanks Nicholas for the side comment; you'll need to interpret "CREATE
>> may add the confusion.
>> Ryan, thanks for the detailed analysis and proposal. That's what I would
>> like to see in discussion thread.
>> I'm open to solutions which enable end users to specify their intention
>> properly - my main concern of SPARK-30098 is that it becomes unclear which
>> provider the query will use in create table unless USING provider is
>> explicitly specified. If the new proposal makes clear on this, that should
>> be better than now.
>> Replying inline:
>> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
>>> wrote:
>>> Side comment: The current docs for CREATE TABLE
>>> <>
>>> add to the confusion by describing the Hive-compatible command as "CREATE
>>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
>>> actually part of the syntax
>>> <>
>>> .
>>> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <>
>>> wrote:
>>>> Jungtaek, it sounds like you consider the two rules to be separate
>>>> syntaxes with their own consistency rules. For example, if I am using the
>>>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
>>>> columns and requires types for those columns; if I’m using the Spark syntax
>>>> rule with USING then PARTITIONED BY must reference existing columns
>>>> and cannot include types.
>>>> I agree that this is confusing to users! We should fix it, but I don’t
>>>> think the right solution is to continue to have two rules with divergent
>>>> syntax.
>>>> This is confusing to users because they don’t know anything about
>>>> separate parser rules. All the user sees is that sometimes PARTITION BY
>>>> requires types and sometimes it doesn’t. Yes, we could add a keyword,
>>>> HIVE, to signal that the syntax is borrowed from Hive for that case,
>>>> but that actually breaks queries that run in Hive.
>>> That might less matter, because SPARK-30098 (and I guess your proposal
>> as well) enforces end users to add "USING HIVE" for their queries to enable
>> Hive provider in any way, even only when the query matches with rule 1
>> (conditional). Once they decide to create Hive table, the query might have
>> to be changed, or they have to change the default provider, or they have to
>> enable legacy config.
>>> I think the right solution is to unify the two syntaxes. I don’t think
>>>> they are so different that it isn’t possible. Here are the differences
>>>> see:
>>>>    - Only in Hive:
>>>>       - EXTERNAL
>>>>       - skewSpec: SKEWED BY ...
>>>>       - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>>>>       - createFileFormat: STORED AS ...
>>>>    - Only in Spark:
>>>>       - OPTIONS property list
>>>>    - Different syntax/interpretation:
>>>>       - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>>>> ":" after column name is another one only supported in Hive, though
>> that's relatively minor to support it in unified syntax.
>>>>    -
>>>> For the clauses that are supported in one but not the other, we can add
>>>> them to a unified rule as optional clauses. The AST builder would then
>>>> validate what makes sense or not (e.g., stored as with using or row format
>>>> delimited) and finally pass the remaining data on using the
>>>> CreateTableStatement. That statement would be handled like we do for
>>>> the Spark rule today, but with extra metadata to pass along. This is also
>>>> step toward being able to put Hive behind the DSv2 API because we’d be
>>>> to pass all of the Hive metadata clauses to the v2 catalog.
>>>> The only difficult part is handling PARTITIONED BY. But in that case,
>>>> we can use two different syntaxes from the same CREATE TABLE rule. If
>>>> types are included, we use the Hive PARTITIONED BY syntax and convert
>>>> in the AST builder to normalize to a single representation.
>>> The proposal looks promising - it may add some sort of complexity but
>> sounds like worth adding.
>> One thing to make clear - in unified syntax we only rely on explicit
>> provider, or default provider, right? I would concern if the proposal
>> automatically uses Hive provider if Hive specific clauses are being used.
>> Yes as I said earlier it may make end users' query to be changed, but
>> better than uncertain.
>> Btw, if the main purpose to add native syntax and change it by default is
>> to discontinue supporting Hive create table rule sooner, simply dropping
>> rule 2 with providing legacy config is still one of valid options I think.
>>> What do you both think? This would make the behavior more clear and take
>>>> a step toward getting rid of Hive-specific code.
>>>> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <
>>>>> wrote:
>>>>> I'm trying to understand the reason you have been suggesting to keep
>>>>> the real thing unchanged but change doc instead. Could you please elaborate
>>>>> why? End users would blame us when they hit the case their query doesn't
>>>>> work as intended (1) and found the fact it's undocumented (2) and hard
>>>>> understand even from the Spark codebase (3).
>>>>> For me, addressing the root issue adopting your suggestion would be
>>>>> "dropping the rule 2" and only supporting it with legacy config on. We
>>>>> would say to end users, you need to enable the legacy config to leverage
>>>>> Hive create table syntax, or just use beeline with Hive connected.
>>>>> But since we are even thinking about native syntax as a first class
>>>>> and dropping Hive one implicitly (hide in doc) or explicitly, does it
>>>>> really matter we require a marker (like "HIVE") in rule 2 and isolate
>>>>> It would have even less confusion than Spark 2.x, since we will require
>>>>> users to fill the marker regarding Hive when creating Hive table, easier
>>>>> classify than "USING provider".
>>>>> If we think native syntax would cover many cases end users have been
>>>>> creating Hive table in Spark (say, USING hive would simply work for them),
>>>>> I'm OK to drop the rule 2 and lead end users to enable the legacy config
>>>>> really needed. If not, let's continue "fixing" the issue.
>>>>> (Another valid approach would be consolidating two rules into one, and
>>>>> defining support of parameters per provider, e.g. EXTERNAL, STORED AS,
>>>>> FORMAT, etc. are only supported in Hive provider.)
>>>>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <>
>>>>> wrote:
>>>>>> The fact that we have 2 CREATE TABLE syntax is already confusing
>>>>>> users. Shall we only document the native syntax? Then users don't
need to
>>>>>> worry about which rule their query fits and they don't need to spend
a lot
>>>>>> of time understanding the subtle difference between these 2 syntaxes.
>>>>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>>>>>>> wrote:
>>>>>>> A bit correction: the example I provided for vice versa is not
>>>>>>> really a correct case for vice versa. It's actually same case
(intended to
>>>>>>> use rule 2 which is not default) but different result.
>>>>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>>>>>>>> wrote:
>>>>>>>> My concern is that although we simply think about the change
>>>>>>>> mark "USING provider" to be optional in rule 1, but in reality
the change
>>>>>>>> is most likely swapping the default rule for CREATE TABLE,
which was "rule
>>>>>>>> 2", and now "rule 1" (it would be the happy case of migration
doc if the
>>>>>>>> swap happens as intended), and there're still couple of things
which make
>>>>>>>> the query still fall into rule 2 which is non-trivial to
reason about and
>>>>>>>> also not easy to explain.
>>>>>>>> I only mentioned ROW FORMAT and STORED AS as the case to
fall into
>>>>>>>> rule 2 to simplify the problem statement, but they're not
only the case
>>>>>>>> - using col_name1:col_type1 would make the query fall into
rule 2
>>>>>>>> regardless of any other properties.
>>>>>>>> What's the matter? In Spark 2.x, if end users want to use
rule 1
>>>>>>>> (which is not default) and specify the parameters which are
only available
>>>>>>>> in rule 1, it clearly requires "USING provider" - parser
will throw error
>>>>>>>> if there're any mistakes on specifying parameters. End users
could set
>>>>>>>> intention clearly which rule the query should be bound. If
the query fails
>>>>>>>> to bind the rule as intended, it's simply denied.
>>>>>>>> In Spark 3.x, parser may not help figuring out the case where
>>>>>>>> users intend to use rule 2 (which is not default) but some
mistake on
>>>>>>>> specifying parameters - it could just "silently" bound into
rule 1 and it
>>>>>>>> may be even executed without any error. Vice versa happens,
but in odd way
>>>>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird
message CREATE
>>>>>>>> EXTERNAL TABLE is not supported.
>>>>>>>> It's deterministic for end users only if they're fully understand
>>>>>>>> the difference between the twos and also understand how Spark
would apply
>>>>>>>> the rules to make the query fall into one of the rule. I'm
not sure how we
>>>>>>>> can only improve documentation to make things be clear, but
if the approach
>>>>>>>> would be explaining the difference of rules and guide the
tip to make the
>>>>>>>> query be bound to the specific rule, the same could be applied
to parser
>>>>>>>> rule to address the root cause.
>>>>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <>
>>>>>>>> wrote:
>>>>>>>>> Document-wise, yes, it's confusing as a simple CREATE
TABLE fits
>>>>>>>>> both native and Hive syntax. I'm fine with some changes
to make it less
>>>>>>>>> confusing, as long as the user-facing behavior doesn't
change. For example,
>>>>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only
if the legacy config
>>>>>>>>> is false.
>>>>>>>>> I still don't get your point about what's the real problem
to end
>>>>>>>>> users. There is no ambiguity as the behavior is deterministic,
although we
>>>>>>>>> rely on optional fields and rule order which is bad.
It's hard to document,
>>>>>>>>> but I don't think that's a big problem to end users.
>>>>>>>>> For the legacy config, it does make the implementation
>>>>>>>>> complicated, but it's invisible to most end users (we
don't document it)
>>>>>>>>> and can be super useful to some users who want the queries
to keep working
>>>>>>>>> in 3.0 without rewriting.
>>>>>>>>> If your only concern is documentation, I totally agree
that we
>>>>>>>>> should improve it.
>>>>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>>>>>>>> wrote:
>>>>>>>>>> Thanks for sharing your view.
>>>>>>>>>> I agree with you it's good for Spark to promote Spark's
>>>>>>>>>> CREATE TABLE syntax. The thing is, we still leave
Hive CREATE TABLE syntax
>>>>>>>>>> unchanged - it's being said as "convenience" but
I'm not sure I can agree
>>>>>>>>>> with.
>>>>>>>>>> I'll quote my comments in SPARK-31136 here again
to make the
>>>>>>>>>> problem statement be clearer:
>>>>>>>>>> I think the parser implementation around CREATE TABLE
>>>>>>>>>> ambiguity which is not documented anywhere. It wasn't
ambiguous because we
>>>>>>>>>> forced to specify USE provider if it's not a Hive
table. Now it's either
>>>>>>>>>> default provider or Hive according to which options
are provided, which
>>>>>>>>>> seems to be non-trivial to reason about. (End users
would never know, as
>>>>>>>>>> it's completely from parser rule.)
>>>>>>>>>> I feel this as the issue of "not breaking old behavior".
>>>>>>>>>> parser rule gets pretty much complicated due to support
legacy config. Not
>>>>>>>>>> breaking anything would make us be stuck eventually.
>>>>>>>>>> Since Spark 3.0, CREATE TABLE without a specific
provider will
>>>>>>>>>> use the value of spark.sql.sources.default as its
provider. In Spark
>>>>>>>>>> version 2.4 and earlier, it was hive. To restore
the behavior before Spark
>>>>>>>>>> 3.0, you can set spark.sql.legacy.createHiveTableByDefault.enabled
to true.
>>>>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided,
and we
>>>>>>>>>> didn't describe anything for this.
>>>>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [
( col_name1
>>>>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING
data_source] [ OPTIONS (
>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY (
col_name1, col_name2, ...
>>>>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... )
[ SORTED BY ( col_name [
>>>>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ]
[ LOCATION path ] [
>>>>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1,
key2=val2, ... ) ] [
>>>>>>>>>> AS select_statement ]
>>>>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier
[ (
>>>>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ],
... ) ] [ COMMENT
>>>>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2
>>>>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ...
>>>>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION
>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement
>>>>>>>>>> At least we should describe that parser will try
to match the
>>>>>>>>>> first case (create table ~ using data source), and
fail back to second
>>>>>>>>>> case; even though we describe this it's not intuitive
to reason about which
>>>>>>>>>> rule the DDL query will fall into. As I commented
earlier, "ROW FORMAT" and
>>>>>>>>>> "STORED AS" are the options which make DDL query
fall into the second case,
>>>>>>>>>> but they're described as "optional" so it's hard
to catch the gotcha.
>>>>>>>>>> Furthermore, while we document the syntax as above,
in reality we
>>>>>>>>>> allow "EXTERNAL" in first rule (and throw error),
which ends up existing
>>>>>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken.
Even we add "USING
>>>>>>>>>> hive" parser will throw error. It now requires "ROW
>>>>>>>>>> Simply saying, do we really think end users should
stop and try
>>>>>>>>>> to match their query against the parser rules (or
orders when we explain in
>>>>>>>>>> the doc) by themselves to understand which provider
the table will
>>>>>>>>>> leverage? I'm sorry but I think we are making bad
assumption on end users
>>>>>>>>>> which is a serious problem.
>>>>>>>>>> If we really want to promote Spark's one for CREATE
TABLE, then
>>>>>>>>>> would it really matter to treat Hive CREATE TABLE
be "exceptional" one and
>>>>>>>>>> try to isolate each other? What's the point of providing
a legacy config to
>>>>>>>>>> go back to the old one even we fear about breaking
something to make it
>>>>>>>>>> better or clearer? We do think that table provider
is important (hence the
>>>>>>>>>> change was done), then is it still a trivial problem
whether the provider
>>>>>>>>>> is affected by specifying the "optional" fields?
>>>>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <>
>>>>>>>>>> wrote:
>>>>>>>>>>> I think the general guideline is to promote Spark's
>>>>>>>>>>> TABLE syntax instead of the Hive one. Previously
these two rules are
>>>>>>>>>>> mutually exclusive because the native syntax
requires the USING clause
>>>>>>>>>>> while the Hive syntax makes ROW FORMAT or STORED
AS clause optional.
>>>>>>>>>>> It's a good move to make the USING clause optional,
which makes
>>>>>>>>>>> it easier to write the native CREATE TABLE syntax.
Unfortunately, it leads
>>>>>>>>>>> to some conflicts with the Hive CREATE TABLE
syntax, but I don't see a
>>>>>>>>>>> serious problem here. If a user just writes CREATE
>>>>>>>>>>> or ROW FORMAT or STORED AS, does it matter what
table we create? Internally
>>>>>>>>>>> the parser rules conflict and we pick the native
syntax depending on the
>>>>>>>>>>> rule order. But the user-facing behavior looks
>>>>>>>>>>> CREATE EXTERNAL TABLE is a problem as it works
in 2.4 but not in
>>>>>>>>>>> 3.0. Shall we simply remove EXTERNAL from the
native CREATE TABLE syntax?
>>>>>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table
like 2.4.
>>>>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>> I'd like to initiate discussion and hear
the voices for
>>>>>>>>>>>> resolving ambiguous parser rule between two
"create table"s being brought
>>>>>>>>>>>> by SPARK-30098 [1].
>>>>>>>>>>>> Previously, "create table" parser rules were
>>>>>>>>>>>> distinguished via "USING provider", which
was very intuitive and
>>>>>>>>>>>> deterministic. Say, DDL query creates "Hive"
table unless "USING provider"
>>>>>>>>>>>> is specified,
>>>>>>>>>>>> (Please refer the parser rule in branch-2.4
>>>>>>>>>>>> After SPARK-30098, "create table" parser
rules became ambiguous
>>>>>>>>>>>> (please refer the parser rule in branch-3.0
[3]) - the factors
>>>>>>>>>>>> differentiating two rules are only "ROW FORMAT"
and "STORED AS" which are
>>>>>>>>>>>> all defined as "optional". Now it relies
on the "order" of parser rule
>>>>>>>>>>>> which end users would have no idea to reason
about, and very unintuitive.
>>>>>>>>>>>> Furthermore, undocumented rule of EXTERNAL
(added in the first
>>>>>>>>>>>> rule to provide better message) brought more
confusion (I've described the
>>>>>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>>>>> Personally I'd like to see two rules mutually
>>>>>>>>>>>> instead of trying to document the difference
and talk end users to be
>>>>>>>>>>>> careful about their query. I'm seeing two
ways to make rules be mutually
>>>>>>>>>>>> exclusive:
>>>>>>>>>>>> 1. Add some identifier in create Hive table
rule, like `CREATE
>>>>>>>>>>>> ... "HIVE" TABLE ...`.
>>>>>>>>>>>> pros. This is the simplest way to distinguish
between two rules.
>>>>>>>>>>>> cons. This would lead end users to change
their query if they
>>>>>>>>>>>> intend to create Hive table. (Given we will
also provide legacy option I'm
>>>>>>>>>>>> feeling this is acceptable.)
>>>>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as
mandatory one.
>>>>>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>>>>>> cons. Less intuitive, because they have been
optional and now
>>>>>>>>>>>> become mandatory to fall into the second
>>>>>>>>>>>> Would like to hear everyone's voices; better
ideas are welcome!
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>>>> 1. SPARK-30098 Use default datasource as
provider for CREATE
>>>>>>>>>>>> TABLE syntax
>>>>>>>>>>>> 2.
>>>>>>>>>>>> 3.
>>>>>>>>>>>> 4.
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix

Ryan Blue
Software Engineer

View raw message