spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: FYI: The evolution on `CHAR` type behavior
Date Fri, 20 Mar 2020 03:41:43 GMT
+1 for Wenchen's suggestion.

I believe that the difference and effects are informed widely and discussed
in many ways twice.

First, this was shared on last December.

    "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax", 2019/12/06

https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

Second (at this time in this thread), this has been discussed according to
the new community rubric.

    - https://spark.apache.org/versioning-policy.html (Section:
"Considerations When Breaking APIs")

Thank you all.

Bests,
Dongjoon.

On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan <cloud0fan@gmail.com> wrote:

> OK let me put a proposal here:
>
> 1. Permanently ban CHAR for native data source tables, and only keep it
> for Hive compatibility.
> It's OK to forget about padding like what Snowflake and MySQL have done.
> But it's hard for Spark to require consistent behavior about CHAR type in
> all data sources. Since CHAR type is not that useful nowadays, seems OK to
> just ban it. Another way is to document that the padding of CHAR type is
> data source dependent, but it's a bit weird to leave this inconsistency in
> Spark.
>
> 2. Leave VARCHAR unchanged in 3.0
> VARCHAR type is so widely used in databases and it's weird if Spark
> doesn't support it. VARCHAR type is exactly the same as Spark StringType
> when the length limitation is not hit, and I'm fine to temporarily leave
> this flaw in 3.0 and users may hit behavior changes when the string values
> hit the VARCHAR length limitation.
>
> 3. Finalize the VARCHAR behavior in 3.1
> For now I have 2 ideas:
> a) Make VARCHAR(x) a first-class data type. This means Spark data sources
> should support VARCHAR, and CREATE TABLE should fail if a column is VARCHAR
> type and the underlying data source doesn't support it (e.g. JSON/CSV).
> Type cast, type coercion, table insertion, etc. should be updated as well.
> b) Simply document that, the underlying data source may or may not enforce
> the length limitation of VARCHAR(x).
>
> Please let me know if you have different ideas.
>
> Thanks,
> Wenchen
>
> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust <michael@databricks.com>
> wrote:
>
>> What I'd oppose is to just ban char for the native data sources, and do
>>> not have a plan to address this problem systematically.
>>>
>>
>> +1
>>
>>
>>> Just forget about padding, like what Snowflake and MySQL have done.
>>> Document that char(x) is just an alias for string. And then move on. Almost
>>> no work needs to be done...
>>>
>>
>> +1
>>
>>
>

Mime
View raw message