spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: FYI: The evolution on `CHAR` type behavior
Date Mon, 16 Mar 2020 06:15:04 GMT
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rxin@databricks.com> wrote:

> Are we sure "not padding" is "incorrect"?
>
> I don't know whether ANSI SQL actually requires padding, but plenty of
> databases don't actually pad.
>
> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
> <https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.>
:
> "Snowflake currently deviates from common CHAR semantics in that strings
> shorter than the maximum length are not space-padded at the end."
>
> MySQL:
> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>
>
>
>
>
>
>
>
> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <dongjoon.hyun@gmail.com>
> wrote:
>
>> Hi, Reynold.
>>
>> Please see the following for the context.
>>
>> https://issues.apache.org/jira/browse/SPARK-31136
>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>> syntax"
>>
>> I raised the above issue according to the new rubric, and the banning was
>> the proposed alternative to reduce the potential issue.
>>
>> Please give us your opinion since it's still PR.
>>
>> Bests,
>> Dongjoon.
>>
>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rxin@databricks.com> wrote:
>>
>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell
out
>>> of both new and old users?
>>>
>>> For old users, their old code that was working for char(3) would now
>>> stop working.
>>>
>>> For new users, depending on whether the underlying metastore char(3) is
>>> either supported but different from ansi Sql (which is not that big of a
>>> deal if we explain it) or not supported.
>>>
>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>> type behavior among its usages and configurations. However, the evolution
>>>> direction has been gradually moving forward to be consistent inside Apache
>>>> Spark because we don't have `CHAR` offically. The following is the summary.
>>>>
>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>> Hive behavior.)
>>>>
>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>
>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>
>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>     a   3
>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>     a   3
>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>     a 2
>>>>
>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>> behavior.)
>>>>
>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>     a   3
>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>     a 2
>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>     a 2
>>>>
>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
>>>> became consistent.
>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>> fallback to Hive behavior.)
>>>>
>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>     a 2
>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>     a 2
>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>     a 2
>>>>
>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in
>>>> the following syntax to be safe.
>>>>
>>>>     CREATE TABLE t(a CHAR(3));
>>>>     https://github.com/apache/spark/pull/27902
>>>>
>>>> This email is sent out to inform you based on the new policy we voted.
>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> References:
>>>> 1. "CHAR implementation?", 2017/09/15
>>>>
>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
>>>> TABLE syntax", 2019/12/06
>>>>
>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>
>>>
>

Mime
View raw message