spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Solimando <alessandro.solima...@gmail.com>
Subject Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default
Date Mon, 07 Oct 2019 10:05:44 GMT
+1 (non-binding)

I have been following this standardization effort and I think it is sound
and it provides the needed flexibility via the option.

Best regards,
Alessandro

On Mon, 7 Oct 2019 at 10:24, Gengliang Wang <gengliang.wang@databricks.com>
wrote:

> Hi everyone,
>
> I'd like to call for a new vote on SPARK-28885
> <https://issues.apache.org/jira/browse/SPARK-28885> "Follow ANSI store
> assignment rules in table insertion by default" after revising the ANSI
> store assignment policy(SPARK-29326
> <https://issues.apache.org/jira/browse/SPARK-29326>).
> When inserting a value into a column with the different data type, Spark
> performs type coercion. Currently, we support 3 policies for the store
> assignment rules: ANSI, legacy and strict, which can be set via the option
> "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice,
> the behavior is mostly the same as PostgreSQL. It disallows certain
> unreasonable type conversions such as converting `string` to `int` and
> `double` to `boolean`. It will throw a runtime exception if the value is
> out-of-range(overflow).
> 2. Legacy: Spark allows the store assignment as long as it is a valid
> `Cast`, which is very loose. E.g., converting either `string` to `int` or
> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
> for compatibility with Hive. When inserting an out-of-range value to an
> integral field, the low-order bits of the value is inserted(the same as
> Java/Scala numeric type casting). For example, if 257 is inserted into a
> field of Byte type, the result is 1.
> 3. Strict: Spark doesn't allow any possible precision loss or data
> truncation in store assignment, e.g., converting either `double` to `int`
> or `decimal` to `double` is allowed. The rules are originally for Dataset
> encoder. As far as I know, no mainstream DBMS is using this policy by
> default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while V2
> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
> and V2 in Spark 3.0.
>
> This vote is open until Friday (Oct. 11).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>

Mime
View raw message