spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: [DISCUSS] Syntax for table DDL
Date Thu, 04 Oct 2018 15:21:43 GMT
Sounds good. I'll plan on adding a PR with Hive's CHANGE syntax in addition
to what I've proposed here.

I have all of these working in our Spark distribution, so I'm just waiting
on finalizing the TableCatalog API to submit these upstream.

On Wed, Oct 3, 2018 at 10:07 PM Wenchen Fan <cloud0fan@gmail.com> wrote:

> Thank you Ryan for proposing the DDL syntax! I think it's good to follow
> mainstream databases, and the proposed syntax looks very reasonable.
>
> About Hive compatibility, I think it's not that important now, but it's
> still good if we keep it. Shall we support the Hive syntax as an
> alternative? It seems not very hard, just a few more ANTLR rules. It will
> be better if we can make ANTLR extensible and allow other data sources to
> define custom SQL syntax.
>
> Anyway I think they are orthogonal. We can go ahead with the proposed
> syntax here, and add Hive compatible syntax later.
>
> On Tue, Oct 2, 2018 at 11:50 PM Ryan Blue <rblue@netflix.com.invalid>
> wrote:
>
>> I'd say that it was important to be compatible with Hive in the past, but
>> that's becoming less important over time. Spark is well established with
>> Hadoop users and I think the focus moving forward should be to make Spark
>> more predictable as a SQL engine for people coming from more traditional
>> databases..
>>
>> That said, I think there is no problem supporting the alter syntax for
>> both Hive/MySQL and the more standard versions.
>>
>> On Tue, Oct 2, 2018 at 8:35 AM Felix Cheung <felixcheung_m@hotmail.com>
>> wrote:
>>
>>> I think it has been an important “selling point” that Spark is “mostly
>>> compatible“ with Hive DDL.
>>>
>>> I have see a lot of teams suffering from switching between Presto and
>>> Hive dialects.
>>>
>>> So one question I have is, we are at a point of switch from Hive
>>> compatible to ANSI SQL, say?
>>>
>>> Perhaps a more critical question, what does it take to get the platform
>>> to support both, by making the ANTLR extensible?
>>>
>>>
>>>
>>> ------------------------------
>>> *From:* Alessandro Solimando <alessandro.solimando@gmail.com>
>>> *Sent:* Tuesday, October 2, 2018 12:35 AM
>>> *To:* rblue@netflix.com
>>> *Cc:* Xiao Li; dev
>>> *Subject:* Re: [DISCUSS] Syntax for table DDL
>>>
>>> I agree with Ryan, a "standard" and more widely adopted syntax is
>>> usually a good idea, with possibly some slight improvements like "bulk
>>> deletion" of columns (especially because both the syntax and the semantics
>>> are clear), rather than stay with Hive syntax at any cost.
>>>
>>> I am personally following this PR with a lot of interest, thanks for all
>>> the work along this direction.
>>>
>>> Best regards,
>>> Alessandro
>>>
>>> On Mon, 1 Oct 2018 at 20:21, Ryan Blue <rblue@netflix.com.invalid>
>>> wrote:
>>>
>>>> What do you mean by consistent with the syntax in SqlBase.g4? These
>>>> aren’t currently defined, so we need to decide what syntax to support.
>>>> There are more details below, but the syntax I’m proposing is more standard
>>>> across databases than Hive, which uses confusing and non-standard syntax.
>>>>
>>>> I doubt that we want to support Hive syntax for a few reasons. Hive
>>>> uses the same column CHANGE statement for multiple purposes, so it
>>>> ends up with strange patterns for simple tasks, like updating the column’s
>>>> type:
>>>>
>>>> ALTER TABLE t CHANGE a1 a1 INT;
>>>>
>>>> The column name is doubled because old name, new name, and type are
>>>> always required. So you have to know the type of a column to change its
>>>> name and you have to double up the name to change its type. Hive also
>>>> allows a couple other oddities:
>>>>
>>>>    - Column reordering with FIRST and AFTER keywords. Column
>>>>    reordering is tricky to get right so I’m not sure we want to add it.
>>>>    - RESTRICT and CASCADE to signal whether to change all partitions
>>>>    or not. Spark doesn’t support partition-level schemas except through
Hive,
>>>>    and even then I’m not sure how reliable it is.
>>>>
>>>> I know that we wouldn’t necessarily have to support these features from
>>>> Hive, but I’m pointing them out to ask the question: why copy Hive’s
syntax
>>>> if it is unlikely that Spark will implement all of the “features”? I’d
>>>> rather go with SQL syntax from databases like PostgreSQL or others that are
>>>> more standard and common.
>>>>
>>>> The more “standard” versions of these statements are like what I’ve
>>>> proposed:
>>>>
>>>>    - ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER
>>>>    is used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL
>>>>    and Oracle. COLUMN is optional in Oracle and TYPE is omitted by
>>>>    databases other than PosgreSQL. I think we could easily add MODIFY
>>>>    as an alternative to the second ALTER (and maybe alternatives like
>>>>    UPDATE and CHANGE) and make both TYPE and COLUMN optional.
>>>>    - ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName:
>>>>    This syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the
>>>>    same syntax as Hive and it appears that SQL server doesn’t have this
>>>>    statement. This also match the table rename syntax, which uses TO.
>>>>    - ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This
>>>>    matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN
>>>>    optional. Most don’t allow deleting multiple columns, but it’s a reasonable
>>>>    extension.
>>>>
>>>> While we’re on the subject of ALTER TABLE DDL, I should note that all
>>>> of the databases use ADD COLUMN syntax that differs from Hive (and
>>>> currently, Spark):
>>>>
>>>>    - ALTER TABLE ident ADD COLUMN qualifiedName dataType (','
>>>>    qualifiedName dataType)*: All other databases I looked at use ADD
>>>>    COLUMN, but not all of them support adding multiple columns at the
>>>>    same time. Hive requires ( and ) enclosing the columns and uses the
>>>>    COLUMNS keyword instead of COLUMN. I think that Spark should be
>>>>    updated to make the parens optional and to support both keywords,
>>>>    COLUMN and COLUMNS.
>>>>
>>>> What does everyone think? Is it reasonable to use the more standard
>>>> syntax instead of using Hive as a base?
>>>>
>>>> rb
>>>>
>>>> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <gatorsmile@gmail.com> wrote:
>>>>
>>>>> Are they consistent with the current syntax defined in SqlBase.g4? I
>>>>> think we are following the Hive DDL syntax:
>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
>>>>>
>>>>> Ryan Blue <rblue@netflix.com.invalid> 于2018年9月28日周五
下午3:47写道:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I’m currently working on new table DDL statements for v2 tables.
For
>>>>>> context, the new logical plans for DataSourceV2 require a catalog
interface
>>>>>> so that Spark can create tables for operations like CTAS. The proposed
>>>>>> TableCatalog API also includes an API for altering those tables so
we can
>>>>>> make ALTER TABLE statements work. I’m implementing those DDL statements,
>>>>>> which will make it into upstream Spark when the TableCatalog PR is
merged.
>>>>>>
>>>>>> Since I’m adding new SQL statements that don’t yet exist in Spark,
I
>>>>>> want to make sure that the syntax I’m using in our branch will
match the
>>>>>> syntax we add to Spark later. I’m basing this proposed syntax on
>>>>>> PostgreSQL
>>>>>> <https://www.postgresql.org/docs/current/static/ddl-alter.html>.
>>>>>>
>>>>>>    - *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN
>>>>>>    qualifiedName TYPE dataType.
>>>>>>    - *Rename column*: ALTER TABLE tableIdentifier RENAME COLUMN
>>>>>>    qualifiedName TO qualifiedName
>>>>>>    - *Drop column*: ALTER TABLE tableIdentifier DROP (COLUMN |
>>>>>>    COLUMNS) qualifiedNameList
>>>>>>
>>>>>> A few notes:
>>>>>>
>>>>>>    - Using qualifiedName in these rules allows updating nested
>>>>>>    types, like point.x.
>>>>>>    - Updates and renames can only alter one column, but drop can
>>>>>>    drop a list.
>>>>>>    - Rename can’t move types and will validate that if the TO name
>>>>>>    is qualified, that the prefix matches the original field.
>>>>>>    - I’m also changing ADD COLUMN to support adding fields to nested
>>>>>>    columns by using qualifiedName instead of identifier.
>>>>>>
>>>>>> Please reply to this thread if you have suggestions based on a
>>>>>> different SQL engine or want this syntax to be different for another
>>>>>> reason. Thanks!
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message