spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiao Li <lix...@databricks.com>
Subject Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default
Date Thu, 10 Oct 2019 13:30:26 GMT
+1

On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon <gurwls223@gmail.com> wrote:

> +1 (binding)
>
> 2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro <linguin.m.s@gmail.com>님이
작성:
>
>> Thanks for the great work, Gengliang!
>>
>> +1 for that.
>> As I said before, the behaviour is pretty common in DBMSs, so the change
>> helps for DMBS users.
>>
>> Bests,
>> Takeshi
>>
>>
>> On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
>> gengliang.wang@databricks.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to call for a new vote on SPARK-28885
>>> <https://issues.apache.org/jira/browse/SPARK-28885> "Follow ANSI store
>>> assignment rules in table insertion by default" after revising the ANSI
>>> store assignment policy(SPARK-29326
>>> <https://issues.apache.org/jira/browse/SPARK-29326>).
>>> When inserting a value into a column with the different data type, Spark
>>> performs type coercion. Currently, we support 3 policies for the store
>>> assignment rules: ANSI, legacy and strict, which can be set via the option
>>> "spark.sql.storeAssignmentPolicy":
>>> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
>>> practice, the behavior is mostly the same as PostgreSQL. It disallows
>>> certain unreasonable type conversions such as converting `string` to `int`
>>> and `double` to `boolean`. It will throw a runtime exception if the value
>>> is out-of-range(overflow).
>>> 2. Legacy: Spark allows the store assignment as long as it is a valid
>>> `Cast`, which is very loose. E.g., converting either `string` to `int` or
>>> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
>>> for compatibility with Hive. When inserting an out-of-range value to an
>>> integral field, the low-order bits of the value is inserted(the same as
>>> Java/Scala numeric type casting). For example, if 257 is inserted into a
>>> field of Byte type, the result is 1.
>>> 3. Strict: Spark doesn't allow any possible precision loss or data
>>> truncation in store assignment, e.g., converting either `double` to `int`
>>> or `decimal` to `double` is allowed. The rules are originally for Dataset
>>> encoder. As far as I know, no mainstream DBMS is using this policy by
>>> default.
>>>
>>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>>> and V2 in Spark 3.0.
>>>
>>> This vote is open until Friday (Oct. 11).
>>>
>>> [ ] +1: Accept the proposal
>>> [ ] +0
>>> [ ] -1: I don't think this is a good idea because ...
>>>
>>> Thank you!
>>>
>>> Gengliang
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
> --
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>

Mime
View raw message