spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: FYI: The evolution on `CHAR` type behavior
Date Tue, 17 Mar 2020 00:54:53 GMT
Thank you for sharing and confirming.

We had better consider all heterogeneous customers in the world. And, I
also have experiences with the non-negligible cases in on-prem.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin <rxin@databricks.com> wrote:

> −User
>
> char barely showed up (honestly negligible). I was comparing select vs
> select.
>
>
>
> On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun <dongjoon.hyun@gmail.com>
> wrote:
>
>> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
>> statements with `CHAR`?
>>
>> > I looked up our usage logs (sorry I can't share this publicly) and trim
>> has at least four orders of magnitude higher usage than char.
>>
>> We need to discuss more about what to do. This thread is what I expected
>> exactly. :)
>>
>> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
>> it). I was merely pointing out that if we deviate away from SQL standard in
>> any way we are considered "wrong" or "incorrect". That argument itself is
>> flawed when plenty of other popular database systems also deviate away from
>> the standard on this specific behavior.
>>
>> Bests,
>> Dongjoon.
>>
>> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin <rxin@databricks.com> wrote:
>>
>>> BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>> it). I was merely pointing out that if we deviate away from SQL standard in
>>> any way we are considered "wrong" or "incorrect". That argument itself is
>>> flawed when plenty of other popular database systems also deviate away from
>>> the standard on this specific behavior.
>>>
>>>
>>>
>>>
>>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin <rxin@databricks.com>
>>> wrote:
>>>
>>>> I looked up our usage logs (sorry I can't share this publicly) and trim
>>>> has at least four orders of magnitude higher usage than char.
>>>>
>>>>
>>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun <dongjoon.hyun@gmail.com
>>>> > wrote:
>>>>
>>>>> Thank you, Stephen and Reynold.
>>>>>
>>>>> To Reynold.
>>>>>
>>>>> The way I see the following is a little different.
>>>>>
>>>>>       > CHAR is an undocumented data type without clearly defined
>>>>> semantics.
>>>>>
>>>>> Let me describe in Apache Spark User's View point.
>>>>>
>>>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql`
>>>>> function) at Apache Spark 1.x without much documentation. In addition,
>>>>> there still exists an effort which is trying to keep it in 3.0.0 age.
>>>>>
>>>>>        https://issues.apache.org/jira/browse/SPARK-31088
>>>>>        Add back HiveContext and createExternalTable
>>>>>
>>>>> Historically, we tried to make many SQL-based customer migrate their
>>>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>>>
>>>>> Although Apache Spark didn't have a good document about the
>>>>> inconsistent behavior among its data sources, Apache Hive has been
>>>>> providing its documentation and many customers rely the behavior.
>>>>>
>>>>>       -
>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
>>>>>
>>>>> At that time, frequently in on-prem Hadoop clusters by well-known
>>>>> vendors, many existing huge tables were created by Apache Hive, not Apache
>>>>> Spark. And, Apache Spark is used for boosting SQL performance with its
>>>>> *caching*. This was true because Apache Spark was added into the
>>>>> Hadoop-vendor products later than Apache Hive.
>>>>>
>>>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>>>> features to be consistent at least with Hive tables in Apache Hive and
>>>>> Apache Spark because two SQL engines share the same tables.
>>>>>
>>>>> For the following, technically, while Apache Hive doesn't changed its
>>>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>>>> away from the original Apache Spark old behaviors one-by-one.
>>>>>
>>>>>       >  the value is already fucked up
>>>>>
>>>>> The following is the change log.
>>>>>
>>>>>       - When we switched the default value of
>>>>> `convertMetastoreParquet`. (at Apache Spark 1.2)
>>>>>       - When we switched the default value of `convertMetastoreOrc`
>>>>> (at Apache Spark 2.4)
>>>>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table
>>>>> to `PARQUET` table at Apache Spark 3.0)
>>>>>
>>>>> To sum up, this has been a well-known issue in the community and among
>>>>> the customers.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <scoy@infomedia.com.au>
>>>>> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I’m kind of new around here, but I have had experience with all
of
>>>>>> all the so called “big iron” databases such as Oracle, IBM DB2
and
>>>>>> Microsoft SQL Server as well as Postgresql.
>>>>>>
>>>>>> They all support the notion of “ANSI padding” for CHAR columns
-
>>>>>> which means that such columns are always space padded, and they default
to
>>>>>> having this enabled (for ANSI compliance).
>>>>>>
>>>>>> MySQL also supports it, but it defaults to leaving it disabled for
>>>>>> historical reasons not unlike what we have here.
>>>>>>
>>>>>> In my opinion we should push toward standards compliance where
>>>>>> possible and then document where it cannot work.
>>>>>>
>>>>>> If users don’t like the padding on CHAR columns then they should
>>>>>> change to VARCHAR - I believe that was its purpose in the first place,
and
>>>>>> it does not dictate any sort of “padding".
>>>>>>
>>>>>> I can see why you might “ban” the use of CHAR columns where they
>>>>>> cannot be consistently supported, but VARCHAR is a different animal
and I
>>>>>> would expect it to work consistently everywhere.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Steve C
>>>>>>
>>>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <dongjoon.hyun@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi, Reynold.
>>>>>> (And +Michael Armbrust)
>>>>>>
>>>>>> If you think so, do you think it's okay that we change the return
>>>>>> value silently? Then, I'm wondering why we reverted `TRIM` functions
then?
>>>>>>
>>>>>> > Are we sure "not padding" is "incorrect"?
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
>>>>>> gourav.sengupta@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> 100% agree with Reynold.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gourav Sengupta
>>>>>>>
>>>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rxin@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>>>>
>>>>>>>> I don't know whether ANSI SQL actually requires padding,
but plenty
>>>>>>>> of databases don't actually pad.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0>
:
>>>>>>>> "Snowflake currently deviates from common CHAR semantics
in that strings
>>>>>>>> shorter than the maximum length are not space-padded at the
end."
>>>>>>>>
>>>>>>>> MySQL:
>>>>>>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <
>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, Reynold.
>>>>>>>>>
>>>>>>>>> Please see the following for the context.
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-31136
>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
>>>>>>>>> "Revert SPARK-30098 Use default datasource as provider
for CREATE
>>>>>>>>> TABLE syntax"
>>>>>>>>>
>>>>>>>>> I raised the above issue according to the new rubric,
and the
>>>>>>>>> banning was the proposed alternative to reduce the potential
issue.
>>>>>>>>>
>>>>>>>>> Please give us your opinion since it's still PR.
>>>>>>>>>
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>>
>>>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rxin@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I don’t understand this change. Wouldn’t this
“ban” confuse the
>>>>>>>>>> hell out of both new and old users?
>>>>>>>>>>
>>>>>>>>>> For old users, their old code that was working for
char(3) would
>>>>>>>>>> now stop working.
>>>>>>>>>>
>>>>>>>>>> For new users, depending on whether the underlying
metastore
>>>>>>>>>> char(3) is either supported but different from ansi
Sql (which is not that
>>>>>>>>>> big of a deal if we explain it) or not supported.
>>>>>>>>>>
>>>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <
>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, All.
>>>>>>>>>>>
>>>>>>>>>>> Apache Spark has been suffered from a known consistency
issue on
>>>>>>>>>>> `CHAR` type behavior among its usages and configurations.
However, the
>>>>>>>>>>> evolution direction has been gradually moving
forward to be consistent
>>>>>>>>>>> inside Apache Spark because we don't have `CHAR`
offically. The following
>>>>>>>>>>> is the summary.
>>>>>>>>>>>
>>>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the
following different
>>>>>>>>>>> result.
>>>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false`
provides a
>>>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>>>
>>>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3))
STORED AS ORC;
>>>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3))
STORED AS PARQUET;
>>>>>>>>>>>
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT
'a ';
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT
'a ';
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT
'a ';
>>>>>>>>>>>
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>>
>>>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides
a fallback
>>>>>>>>>>> to Hive behavior.)
>>>>>>>>>>>
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>>
>>>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without
`STORED AS`
>>>>>>>>>>> clause) became consistent.
>>>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true`
>>>>>>>>>>> provides a fallback to Hive behavior.)
>>>>>>>>>>>
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>>
>>>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban
`CHAR/VARCHAR`
>>>>>>>>>>> type in the following syntax to be safe.
>>>>>>>>>>>
>>>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>>>     https://github.com/apache/spark/pull/27902
>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>
>>>>>>>>>>>
>>>>>>>>>>> This email is sent out to inform you based on
the new policy we
>>>>>>>>>>> voted.
>>>>>>>>>>> The recommendation is always using Apache Spark's
native type
>>>>>>>>>>> `String`.
>>>>>>>>>>>
>>>>>>>>>>> Bests,
>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>
>>>>>>>>>>> References:
>>>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>>>
>>>>>>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
>>>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as
provider for
>>>>>>>>>>> CREATE TABLE syntax", 2019/12/06
>>>>>>>>>>>
>>>>>>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>> This email contains confidential information of and is the copyright
>>>>>> of Infomedia. It must not be forwarded, amended or disclosed without
>>>>>> consent of the sender. If you received this message by mistake, please
>>>>>> advise the sender and delete all copies. Security of transmission
on the
>>>>>> internet cannot be guaranteed, could be infected, intercepted, or
corrupted
>>>>>> and you should ensure you have suitable antivirus protection in place.
By
>>>>>> sending us your or any third party personal details, you consent
to (or
>>>>>> confirm you have obtained consent from such third parties) to Infomedia’s
>>>>>> privacy policy. http://www.infomedia.com.au/privacy-policy/
>>>>>>
>>>>>
>

Mime
View raw message