spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: [Discuss] Follow ANSI SQL on table insertion
Date Sat, 27 Jul 2019 13:04:56 GMT
Hi, all

+1 for implementing this new store cast mode.
>From a viewpoint of DBMS users, this cast is pretty common for INSERTs and
I think this functionality could
promote migrations from existing DBMSs to Spark.

The most important thing for DBMS users is that they could optionally
choose this mode when inserting data.
Therefore, I think it might be okay that the two modes (the current upcast
mode and the proposed store cast mode)
co-exist for INSERTs. (There is a room to discuss which mode  is enabled by
default though...)

IMHO we'll provide three behaviours below for INSERTs;
 - upcast mode
 - ANSI store cast mode and runtime exceptions thrown for invalid values
 - ANSI store cast mode and null filled for invalid values


On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang <
gengliang.wang@databricks.com> wrote:

> Hi Ryan,
>
> Thanks for the suggestions on the proposal and doc.
> Currently, there is no data type validation in table insertion of V1. We
> are on the same page that we should improve it. But using UpCast is from
> one extreme to another. It is possible that many queries are broken after
> upgrading to Spark 3.0.
> The rules of UpCast are too strict. E.g. it doesn't allow assigning
> Timestamp type to Date Type, as there will be "precision loss". To me, the
> type coercion is reasonable and the "precision loss" is under expectation.
> This is very common in other SQL engines.
> As long as Spark is following the ANSI SQL store assignment rules, it is
> users' responsibility to take good care of the type coercion in data
> writing. I think it's the right decision.
>
> > But the new behavior is only applied in DataSourceV2, so it won’t affect
> existing jobs until sources move to v2 and break other behavior anyway.
> Eventually, most sources are supposed to be migrated to DataSourceV2 V2. I
> think we can discuss and make a decision now.
>
> > Fixing the silent corruption by adding a runtime exception is not a good
> option, either.
> The new optional mode proposed in
> https://issues.apache.org/jira/browse/SPARK-28512 is disabled by default.
> This should be fine.
>
>
>
> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>> I don't agree with handling literal values specially. Although Postgres
>> does it, I can't find anything about it in the SQL standard. And it
>> introduces inconsistent behaviors which may be strange to users:
>> * What about something like "INSERT INTO t SELECT float_col + 1.1"?
>> * The same insert with a decimal column as input will fail even when a
>> decimal literal would succeed
>> * Similar insert queries with "literal" inputs can be constructed through
>> layers of indirection via views, inline views, CTEs, unions, etc. Would
>> those decimals be treated as columns and fail or would we attempt to make
>> them succeed as well? Would users find this behavior surprising?
>>
>> Silently corrupt data is bad, but this is the decision we made at the
>> beginning when design Spark behaviors. Whenever an error occurs, Spark
>> attempts to return null instead of runtime exception. Recently we provide
>> configs to make Spark fail at runtime for overflow, but that's another
>> story. Silently corrupt data is bad, runtime exception is bad, and
>> forbidding all the table insertions that may fail(even with very little
>> possibility) is also bad. We have to make trade-offs. The trade-offs we
>> made in this proposal are:
>> * forbid table insertions that are very like to fail, at compile time.
>> (things like writing string values to int column)
>> * allow table insertions that are not that likely to fail. If the data is
>> wrong, don't fail, insert null.
>> * provide a config to fail the insertion at runtime if the data is wrong.
>>
>> >  But the new behavior is only applied in DataSourceV2, so it won’t
>> affect existing jobs until sources move to v2 and break other behavior
>> anyway.
>> When users write SQL queries, they don't care if a table is backed by
>> Data Source V1 or V2. We should make sure the table insertion behavior is
>> consistent and reasonable. Furthermore, users may even not care if the SQL
>> queries are run in Spark or other RDBMS, it's better to follow SQL standard
>> instead of introducing a Spark-specific behavior.
>>
>> We are not talking about a small use case like allowing writing decimal
>> literal to float column, we are talking about a big goal to make Spark
>> compliant to SQL standard, w.r.t.
>> https://issues.apache.org/jira/browse/SPARK-26217 . This proposal is a
>> sub-task of it, to make the table insertion behavior follow SQL standard.
>>
>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rblue@netflix.com> wrote:
>>
>>> I don’t think this is a good idea. Following the ANSI standard is
>>> usually fine, but here it would *silently corrupt data*.
>>>
>>> From your proposal doc, ANSI allows implicitly casting from long to int
>>> (any numeric type to any other numeric type) and inserts NULL when a value
>>> overflows. That would drop data values and is not safe.
>>>
>>> Fixing the silent corruption by adding a runtime exception is not a good
>>> option, either. That puts off the problem until much of the job has
>>> completed, instead of catching the error at analysis time. It is better to
>>> catch this earlier during analysis than to run most of a job and then fail.
>>>
>>> In addition, part of the justification for using the ANSI standard is to
>>> avoid breaking existing jobs. But the new behavior is only applied in
>>> DataSourceV2, so it won’t affect existing jobs until sources move to v2 and
>>> break other behavior anyway.
>>>
>>> I think that the correct solution is to go with the existing validation
>>> rules that require explicit casts to truncate values.
>>>
>>> That still leaves the use case that motivated this proposal, which is
>>> that floating point literals are parsed as decimals and fail simple insert
>>> statements. We already came up with two alternatives to fix that problem in
>>> the DSv2 sync and I think it is a better idea to go with one of those
>>> instead of “fixing” Spark in a way that will corrupt data or cause runtime
>>> failures.
>>>
>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>>
>>>> I have heard about many complaints about the old table insertion
>>>> behavior. Blindly casting everything will leak the user mistake to a late
>>>> stage of the data pipeline, and make it very hard to debug. When a user
>>>> writes string values to an int column, it's probably a mistake and the
>>>> columns are misordered in the INSERT statement. We should fail the query
>>>> earlier and ask users to fix the mistake.
>>>>
>>>> In the meanwhile, I agree that the new table insertion behavior we
>>>> introduced for Data Source V2 is too strict. It may fail valid queries
>>>> unexpectedly.
>>>>
>>>> In general, I support the direction of following the ANSI SQL standard.
>>>> But I'd like to do it with 2 steps:
>>>> 1. only add cast when the assignment rule is satisfied. This should be
>>>> the default behavior and we should provide a legacy config to restore to
>>>> the old behavior.
>>>> 2. fail the cast operation at runtime if overflow happens. AFAIK Marco
>>>> Gaido is working on it already. This will have a config as well and by
>>>> default we still return null.
>>>>
>>>> After doing this, the default behavior will be slightly different from
>>>> the SQL standard (cast can return null), and users can turn on the ANSI
>>>> mode to fully follow the SQL standard. This is much better than before and
>>>> should prevent a lot of user mistakes. It's also a reasonable choice to me
>>>> to not throw exceptions at runtime by default, as it's usually bad for
>>>> long-running jobs.
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
>>>> gengliang.wang@databricks.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I would like to discuss the table insertion behavior of Spark. In the
>>>>> current data source V2, only UpCast is allowed for table insertion. I
think
>>>>> following ANSI SQL is a better idea.
>>>>> For more information, please read the Discuss: Follow ANSI SQL on
>>>>> table insertion
>>>>> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing>
>>>>> Please let me know if you have any thoughts on this.
>>>>>
>>>>> Regards,
>>>>> Gengliang
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

-- 
---
Takeshi Yamamuro

Mime
View raw message