spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <>
Subject Re: [DISCUSS] Syntax for table DDL
Date Thu, 04 Oct 2018 05:07:26 GMT
Thank you Ryan for proposing the DDL syntax! I think it's good to follow
mainstream databases, and the proposed syntax looks very reasonable.

About Hive compatibility, I think it's not that important now, but it's
still good if we keep it. Shall we support the Hive syntax as an
alternative? It seems not very hard, just a few more ANTLR rules. It will
be better if we can make ANTLR extensible and allow other data sources to
define custom SQL syntax.

Anyway I think they are orthogonal. We can go ahead with the proposed
syntax here, and add Hive compatible syntax later.

On Tue, Oct 2, 2018 at 11:50 PM Ryan Blue <> wrote:

> I'd say that it was important to be compatible with Hive in the past, but
> that's becoming less important over time. Spark is well established with
> Hadoop users and I think the focus moving forward should be to make Spark
> more predictable as a SQL engine for people coming from more traditional
> databases..
> That said, I think there is no problem supporting the alter syntax for
> both Hive/MySQL and the more standard versions.
> On Tue, Oct 2, 2018 at 8:35 AM Felix Cheung <>
> wrote:
>> I think it has been an important “selling point” that Spark is “mostly
>> compatible“ with Hive DDL.
>> I have see a lot of teams suffering from switching between Presto and
>> Hive dialects.
>> So one question I have is, we are at a point of switch from Hive
>> compatible to ANSI SQL, say?
>> Perhaps a more critical question, what does it take to get the platform
>> to support both, by making the ANTLR extensible?
>> ------------------------------
>> *From:* Alessandro Solimando <>
>> *Sent:* Tuesday, October 2, 2018 12:35 AM
>> *To:*
>> *Cc:* Xiao Li; dev
>> *Subject:* Re: [DISCUSS] Syntax for table DDL
>> I agree with Ryan, a "standard" and more widely adopted syntax is usually
>> a good idea, with possibly some slight improvements like "bulk deletion" of
>> columns (especially because both the syntax and the semantics are clear),
>> rather than stay with Hive syntax at any cost.
>> I am personally following this PR with a lot of interest, thanks for all
>> the work along this direction.
>> Best regards,
>> Alessandro
>> On Mon, 1 Oct 2018 at 20:21, Ryan Blue <> wrote:
>>> What do you mean by consistent with the syntax in SqlBase.g4? These
>>> aren’t currently defined, so we need to decide what syntax to support.
>>> There are more details below, but the syntax I’m proposing is more standard
>>> across databases than Hive, which uses confusing and non-standard syntax.
>>> I doubt that we want to support Hive syntax for a few reasons. Hive uses
>>> the same column CHANGE statement for multiple purposes, so it ends up
>>> with strange patterns for simple tasks, like updating the column’s type:
>>> The column name is doubled because old name, new name, and type are
>>> always required. So you have to know the type of a column to change its
>>> name and you have to double up the name to change its type. Hive also
>>> allows a couple other oddities:
>>>    - Column reordering with FIRST and AFTER keywords. Column reordering
>>>    is tricky to get right so I’m not sure we want to add it.
>>>    - RESTRICT and CASCADE to signal whether to change all partitions or
>>>    not. Spark doesn’t support partition-level schemas except through Hive,
>>>    even then I’m not sure how reliable it is.
>>> I know that we wouldn’t necessarily have to support these features from
>>> Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax
>>> if it is unlikely that Spark will implement all of the “features”? I’d
>>> rather go with SQL syntax from databases like PostgreSQL or others that are
>>> more standard and common.
>>> The more “standard” versions of these statements are like what I’ve
>>> proposed:
>>>    - ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER
>>>    is used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL
>>>    and Oracle. COLUMN is optional in Oracle and TYPE is omitted by
>>>    databases other than PosgreSQL. I think we could easily add MODIFY
>>>    as an alternative to the second ALTER (and maybe alternatives like
>>>    UPDATE and CHANGE) and make both TYPE and COLUMN optional.
>>>    - ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName:
>>>    This syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the
>>>    same syntax as Hive and it appears that SQL server doesn’t have this
>>>    statement. This also match the table rename syntax, which uses TO.
>>>    - ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This
>>>    matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN
>>>    optional. Most don’t allow deleting multiple columns, but it’s a reasonable
>>>    extension.
>>> While we’re on the subject of ALTER TABLE DDL, I should note that all
>>> of the databases use ADD COLUMN syntax that differs from Hive (and
>>> currently, Spark):
>>>    - ALTER TABLE ident ADD COLUMN qualifiedName dataType (','
>>>    qualifiedName dataType)*: All other databases I looked at use ADD
>>>    COLUMN, but not all of them support adding multiple columns at the
>>>    same time. Hive requires ( and ) enclosing the columns and uses the
>>>    COLUMNS keyword instead of COLUMN. I think that Spark should be
>>>    updated to make the parens optional and to support both keywords,
>>>    COLUMN and COLUMNS.
>>> What does everyone think? Is it reasonable to use the more standard
>>> syntax instead of using Hive as a base?
>>> rb
>>> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <> wrote:
>>>> Are they consistent with the current syntax defined in SqlBase.g4? I
>>>> think we are following the Hive DDL syntax:
>>>> Ryan Blue <> 于2018年9月28日周五 下午3:47写道:
>>>>> Hi everyone,
>>>>> I’m currently working on new table DDL statements for v2 tables. For
>>>>> context, the new logical plans for DataSourceV2 require a catalog interface
>>>>> so that Spark can create tables for operations like CTAS. The proposed
>>>>> TableCatalog API also includes an API for altering those tables so we
>>>>> make ALTER TABLE statements work. I’m implementing those DDL statements,
>>>>> which will make it into upstream Spark when the TableCatalog PR is merged.
>>>>> Since I’m adding new SQL statements that don’t yet exist in Spark,
>>>>> want to make sure that the syntax I’m using in our branch will match
>>>>> syntax we add to Spark later. I’m basing this proposed syntax on
>>>>> PostgreSQL
>>>>> <>.
>>>>>    - *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN
>>>>>    qualifiedName TYPE dataType.
>>>>>    - *Rename column*: ALTER TABLE tableIdentifier RENAME COLUMN
>>>>>    qualifiedName TO qualifiedName
>>>>>    - *Drop column*: ALTER TABLE tableIdentifier DROP (COLUMN |
>>>>>    COLUMNS) qualifiedNameList
>>>>> A few notes:
>>>>>    - Using qualifiedName in these rules allows updating nested types,
>>>>>    like point.x.
>>>>>    - Updates and renames can only alter one column, but drop can drop
>>>>>    a list.
>>>>>    - Rename can’t move types and will validate that if the TO name
>>>>>    qualified, that the prefix matches the original field.
>>>>>    - I’m also changing ADD COLUMN to support adding fields to nested
>>>>>    columns by using qualifiedName instead of identifier.
>>>>> Please reply to this thread if you have suggestions based on a
>>>>> different SQL engine or want this syntax to be different for another
>>>>> reason. Thanks!
>>>>> rb
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
> --
> Ryan Blue
> Software Engineer
> Netflix

View raw message