spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: FYI: The evolution on `CHAR` type behavior
Date Fri, 20 Mar 2020 03:56:43 GMT
Technically, I has been suffered with (1) `CREATE TABLE` due to many
difference for a long time (since 2017). So, I had a wrong assumption for
the implication of that "(2) FYI: SPARK-30098 Use default datasource as
provider for CREATE TABLE syntax", Reynold. I admit that. You may not feel
in the similar way. However, it was a lot to me. Also, switching
`convertMetastoreOrc` at 2.4 was a big change to me although there will be
no difference for Parquet-only users.

Dongjoon.

> References:
> 1. "CHAR implementation?", 2017/09/15
>
https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax", 2019/12/06
>
https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E



On Thu, Mar 19, 2020 at 8:47 PM Reynold Xin <rxin@databricks.com> wrote:

> You are joking when you said " informed widely and discussed in many ways
> twice" right?
>
> This thread doesn't even talk about char/varchar:
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>
> (Yes it talked about changing the default data source provider, but that's
> just one of the ways we are exposing this char/varchar issue).
>
>
>
> On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun <dongjoon.hyun@gmail.com>
> wrote:
>
>> +1 for Wenchen's suggestion.
>>
>> I believe that the difference and effects are informed widely and
>> discussed in many ways twice.
>>
>> First, this was shared on last December.
>>
>>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>> syntax", 2019/12/06
>>
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>
>> Second (at this time in this thread), this has been discussed according
>> to the new community rubric.
>>
>>     - https://spark.apache.org/versioning-policy.html (Section:
>> "Considerations When Breaking APIs")
>>
>> Thank you all.
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>>> OK let me put a proposal here:
>>>
>>> 1. Permanently ban CHAR for native data source tables, and only keep it
>>> for Hive compatibility.
>>> It's OK to forget about padding like what Snowflake and MySQL have done.
>>> But it's hard for Spark to require consistent behavior about CHAR type in
>>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>>> just ban it. Another way is to document that the padding of CHAR type is
>>> data source dependent, but it's a bit weird to leave this inconsistency in
>>> Spark.
>>>
>>> 2. Leave VARCHAR unchanged in 3.0
>>> VARCHAR type is so widely used in databases and it's weird if Spark
>>> doesn't support it. VARCHAR type is exactly the same as Spark StringType
>>> when the length limitation is not hit, and I'm fine to temporarily leave
>>> this flaw in 3.0 and users may hit behavior changes when the string values
>>> hit the VARCHAR length limitation.
>>>
>>> 3. Finalize the VARCHAR behavior in 3.1
>>> For now I have 2 ideas:
>>> a) Make VARCHAR(x) a first-class data type. This means Spark data
>>> sources should support VARCHAR, and CREATE TABLE should fail if a column is
>>> VARCHAR type and the underlying data source doesn't support it (e.g.
>>> JSON/CSV). Type cast, type coercion, table insertion, etc. should be
>>> updated as well.
>>> b) Simply document that, the underlying data source may or may not
>>> enforce the length limitation of VARCHAR(x).
>>>
>>> Please let me know if you have different ideas.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust <michael@databricks.com>
>>> wrote:
>>>
>>>> What I'd oppose is to just ban char for the native data sources, and do
>>>>> not have a plan to address this problem systematically.
>>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>> Just forget about padding, like what Snowflake and MySQL have done.
>>>>> Document that char(x) is just an alias for string. And then move on.
Almost
>>>>> no work needs to be done...
>>>>>
>>>>
>>>> +1
>>>>
>>>
>

Mime
View raw message