spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: FYI: The evolution on `CHAR` type behavior
Date Mon, 16 Mar 2020 02:02:43 GMT
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax"

I raised the above issue according to the new rubric, and the banning was
the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rxin@databricks.com> wrote:

> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
> of both new and old users?
>
> For old users, their old code that was working for char(3) would now stop
> working.
>
> For new users, depending on whether the underlying metastore char(3) is
> either supported but different from ansi Sql (which is not that big of a
> deal if we explain it) or not supported.
>
> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>> type behavior among its usages and configurations. However, the evolution
>> direction has been gradually moving forward to be consistent inside Apache
>> Spark because we don't have `CHAR` offically. The following is the summary.
>>
>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>> Hive behavior.)
>>
>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>
>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>>
>> Since 2.4.0, `STORED AS ORC` became consistent.
>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>> behavior.)
>>
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a 2
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>>
>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>> consistent.
>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>> fallback to Hive behavior.)
>>
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a 2
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a 2
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>>
>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>> following syntax to be safe.
>>
>>     CREATE TABLE t(a CHAR(3));
>>     https://github.com/apache/spark/pull/27902
>>
>> This email is sent out to inform you based on the new policy we voted.
>> The recommendation is always using Apache Spark's native type `String`.
>>
>> Bests,
>> Dongjoon.
>>
>> References:
>> 1. "CHAR implementation?", 2017/09/15
>>
>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>> syntax", 2019/12/06
>>
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>
>

Mime
View raw message