spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default
Date Sat, 07 Sep 2019 00:27:58 GMT
We discussed this thread quite a bit in the DSv2 sync up and Russell
brought up a really good point about this.

The ANSI rule used here specifies how to store a specific value, V, so this
is a runtime rule — an earlier case covers when V is NULL, so it is
definitely referring to a specific value. The rule requires that if the
type doesn’t match or if the value cannot be truncated, an exception is
thrown for “numeric value out of range”.

That runtime error guarantees that even though the cast is introduced at
analysis time, unexpected NULL values aren’t inserted into a table in place
of data values that are out of range. Unexpected NULL values are the
problem that was concerning to many of us in the discussion thread, but it
turns out that real ANSI behavior doesn’t have the problem. (In the sync,
we validated this by checking Postgres and MySQL behavior, too.)

In Spark, the runtime check is a separate configuration property from this
one, but in order to actually implement ANSI semantics, both need to be
set. So I think it makes sense to *change both defaults to be ANSI*. The
analysis check alone does not implement the ANSI standard.

In the sync, we also agreed that it makes sense to be able to turn off the
runtime check in order to avoid job failures. Another, safer way to avoid
job failures is to require an explicit cast, i.e., strict mode.

I think that we should amend this proposal to change the default for both
the runtime check and the analysis check to ANSI.

As this stands now, I vote -1. But I would support this if the vote were to
set both runtime and analysis checks to ANSI mode.

rb

On Fri, Sep 6, 2019 at 3:12 AM Alastair Green
<alastair.green@neo4j.com.invalid> wrote:

> Makes sense.
>
> While the ISO SQL standard automatically becomes an American national
>  (ANSI) standard, changes are only made to the International (ISO/IEC)
> Standard, which is the authoritative specification.
>
> These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2), section
> 9.2.
>
> Could we rename the proposed default to “ISO/IEC (ANSI)”?
>
> — Alastair
>
> On Thu, Sep 5, 2019 at 17:17, Reynold Xin <rxin@databricks.com> wrote:
>
> Having three modes is a lot. Why not just use ansi mode as default, and
> legacy for backward compatibility? Then over time there's only the ANSI
> mode, which is standard compliant and easy to understand. We also don't
> need to invent a standard just for Spark.
>
>
> On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>> +1
>>
>> To be honest I don't like the legacy policy. It's too loose and easy for
>> users to make mistakes, especially when Spark returns null if a function
>> hit errors like overflow.
>>
>> The strict policy is not good either. It's too strict and stops valid use
>> cases like writing timestamp values to a date type column. Users do expect
>> truncation to happen without adding cast manually in this case. It's also
>> weird to use a spark specific policy that no other database is using.
>>
>> The ANSI policy is better. It stops invalid use cases like writing string
>> values to an int type column, while keeping valid use cases like timestamp
>> -> date.
>>
>> I think it's no doubt that we should use ANSI policy instead of legacy
>> policy for v1 tables. Except for backward compatibility, ANSI policy is
>> literally better than the legacy policy.
>>
>> The v2 table is arguable here. Although the ANSI policy is better than
>> strict policy to me, this is just the store assignment policy, which only
>> partially controls the table insertion behavior. With Spark's "return null
>> on error" behavior, the table insertion is more likely to insert invalid
>> null values with the ANSI policy compared to the strict policy.
>>
>> I think we should use ANSI policy by default for both v1 and v2 tables,
>> because
>> 1. End-users don't care how the table is implemented. Spark should
>> provide consistent table insertion behavior between v1 and v2 tables.
>> 2. Data Source V2 is unstable in Spark 2.x so there is no backward
>> compatibility issue. That said, the baseline to judge which policy is
>> better should be the table insertion behavior in Spark 2.x, which is the
>> legacy policy + "return null on error". ANSI policy is better than the
>> baseline.
>> 3. We expect more and more uses to migrate their data sources to the V2
>> API. The strict policy can be a stopper as it's a too big breaking change,
>> which may break many existing queries.
>>
>> Thanks,
>> Wenchen
>>
>>
>> On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang <
>> gengliang.wang@databricks.com> wrote:
>>
>> Hi everyone,
>>
>> I'd like to call for a vote on SPARK-28885 <https://issues.apache.org/jira/browse/SPARK-28885>
"Follow ANSI store assignment rules in table insertion by default".
>> When inserting a value into a column with the different data type, Spark performs
type coercion. Currently, we support 3 policies for the type coercion rules: ANSI, legacy
and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the behavior
is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such
as converting `string` to `int` and `double` to `boolean`.
>> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, which
is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed.
It is the current behavior in Spark 2.x for compatibility with Hive.
>> 3. Strict: Spark doesn't allow any possible precision loss or data truncation in
type coercion, e.g., converting either `double` to `int` or `decimal` to `double` is allowed.
The rules are originally for Dataset encoder. As far as I know, no maintainstream DBMS is
using this policy by default.
>>
>> Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict".
This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.
>>
>> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in the dev mailing
list.
>>
>> This vote is open until next Thurs (Sept. 12nd).
>>
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thank you!
>>
>> Gengliang
>>
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message