spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: [Discuss] Follow ANSI SQL on table insertion
Date Tue, 30 Jul 2019 06:32:49 GMT
>From my look, +1 on the proposal, considering ASCI and other DBMSes in
general.

2019년 7월 30일 (화) 오후 3:21, Wenchen Fan <cloud0fan@gmail.com>님이 작성:

> We can add a config for a certain behavior if it makes sense, but the most
> important thing we want to reach an agreement here is: what should be the
> default behavior?
>
> Let's explore the solution space of table insertion behavior first:
> At compile time,
> 1. always add cast
> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
> int is forbidden but long to int is allowed)
> 3. only add cast if it's 100% safe
> At runtime,
> 1. return null for invalid operations
> 2. throw exceptions at runtime for invalid operations
>
> The standards to evaluate a solution:
> 1. How robust the query execution is. For example, users usually don't
> want to see the query fails midway.
> 2. how tolerant to user queries. For example, a user would like to write
> long values to an int column as he knows all the long values won't exceed
> int range.
> 3. How clean the result is. For example, users usually don't want to see
> silently corrupted data (null values).
>
> The current Spark behavior for Data Source V1 tables: always add cast and
> return null for invalid operations. This maximizes standard 1 and 2, but
> the result is least clean and users are very likely to see silently
> corrupted data (null values).
>
> The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
> only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
> queries may fail to compile, even if these queries can run on other SQL
> systems. Note that, people can still see silently corrupted data because
> cast is not the only one that can return corrupted data. Simple operations
> like ADD can also return corrected data if overflow happens. e.g. INSERT
> INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2
>
> The proposal here: add cast following ANSI SQL store assignment rule, and
> return null for invalid operations. This maximizes standard 1, and also
> fits standard 2 well: if a query can't compile in Spark, it usually can't
> compile in other mainstream databases as well. I think that's tolerant
> enough. For standard 3, this proposal doesn't maximize it but can avoid
> many invalid operations already.
>
> Technically we can't make the result 100% clean at compile-time, we have
> to handle things like overflow at runtime. I think the new proposal makes
> more sense as the default behavior.
>
>
> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer <russell.spitzer@gmail.com>
> wrote:
>
>> I understand spark is making the decisions, i'm say the actual final
>> effect of the null decision would be different depending on the insertion
>> target if the target has different behaviors for null.
>>
>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>>> > I'm a big -1 on null values for invalid casts.
>>>
>>> This is why we want to introduce the ANSI mode, so that invalid cast
>>> fails at runtime. But we have to keep the null behavior for a while, to
>>> keep backward compatibility. Spark returns null for invalid cast since the
>>> first day of Spark SQL, we can't just change it without a way to restore to
>>> the old behavior.
>>>
>>> I'm OK with adding a strict mode for the upcast behavior in table
>>> insertion, but I don't agree with making it the default. The default
>>> behavior should be either the ANSI SQL behavior or the legacy Spark
>>> behavior.
>>>
>>> > other modes should be allowed only with strict warning the behavior
>>> will be determined by the underlying sink.
>>>
>>> Seems there is some misunderstanding. The table insertion behavior is
>>> fully controlled by Spark. Spark decides when to add cast and Spark decided
>>> whether invalid cast should return null or fail. The sink is only
>>> responsible for writing data, not the type coercion/cast stuff.
>>>
>>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <
>>> russell.spitzer@gmail.com> wrote:
>>>
>>>> I'm a big -1 on null values for invalid casts. This can lead to a lot
>>>> of even more unexpected errors and runtime behavior since null is
>>>>
>>>> 1. Not allowed in all schemas (Leading to a runtime error anyway)
>>>> 2. Is the same as delete in some systems (leading to data loss)
>>>>
>>>> And this would be dependent on the sink being used. Spark won't just be
>>>> interacting with ANSI compliant sinks so I think it makes much more sense
>>>> to be strict. I think Upcast mode is a sensible default and other modes
>>>> should be allowed only with strict warning the behavior will be determined
>>>> by the underlying sink.
>>>>
>>>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro <linguin.m.s@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, all
>>>>>
>>>>> +1 for implementing this new store cast mode.
>>>>> From a viewpoint of DBMS users, this cast is pretty common for INSERTs
>>>>> and I think this functionality could
>>>>> promote migrations from existing DBMSs to Spark.
>>>>>
>>>>> The most important thing for DBMS users is that they could optionally
>>>>> choose this mode when inserting data.
>>>>> Therefore, I think it might be okay that the two modes (the current
>>>>> upcast mode and the proposed store cast mode)
>>>>> co-exist for INSERTs. (There is a room to discuss which mode  is
>>>>> enabled by default though...)
>>>>>
>>>>> IMHO we'll provide three behaviours below for INSERTs;
>>>>>  - upcast mode
>>>>>  - ANSI store cast mode and runtime exceptions thrown for invalid
>>>>> values
>>>>>  - ANSI store cast mode and null filled for invalid values
>>>>>
>>>>>
>>>>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang <
>>>>> gengliang.wang@databricks.com> wrote:
>>>>>
>>>>>> Hi Ryan,
>>>>>>
>>>>>> Thanks for the suggestions on the proposal and doc.
>>>>>> Currently, there is no data type validation in table insertion of
V1.
>>>>>> We are on the same page that we should improve it. But using UpCast
is from
>>>>>> one extreme to another. It is possible that many queries are broken
after
>>>>>> upgrading to Spark 3.0.
>>>>>> The rules of UpCast are too strict. E.g. it doesn't allow assigning
>>>>>> Timestamp type to Date Type, as there will be "precision loss". To
me, the
>>>>>> type coercion is reasonable and the "precision loss" is under expectation.
>>>>>> This is very common in other SQL engines.
>>>>>> As long as Spark is following the ANSI SQL store assignment rules,
it
>>>>>> is users' responsibility to take good care of the type coercion in
data
>>>>>> writing. I think it's the right decision.
>>>>>>
>>>>>> > But the new behavior is only applied in DataSourceV2, so it
won’t
>>>>>> affect existing jobs until sources move to v2 and break other behavior
>>>>>> anyway.
>>>>>> Eventually, most sources are supposed to be migrated to DataSourceV2
>>>>>> V2. I think we can discuss and make a decision now.
>>>>>>
>>>>>> > Fixing the silent corruption by adding a runtime exception is
not a
>>>>>> good option, either.
>>>>>> The new optional mode proposed in
>>>>>> https://issues.apache.org/jira/browse/SPARK-28512 is disabled by
>>>>>> default. This should be fine.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I don't agree with handling literal values specially. Although
>>>>>>> Postgres does it, I can't find anything about it in the SQL standard.
And
>>>>>>> it introduces inconsistent behaviors which may be strange to
users:
>>>>>>> * What about something like "INSERT INTO t SELECT float_col +
1.1"?
>>>>>>> * The same insert with a decimal column as input will fail even
when
>>>>>>> a decimal literal would succeed
>>>>>>> * Similar insert queries with "literal" inputs can be constructed
>>>>>>> through layers of indirection via views, inline views, CTEs,
unions, etc.
>>>>>>> Would those decimals be treated as columns and fail or would
we attempt to
>>>>>>> make them succeed as well? Would users find this behavior surprising?
>>>>>>>
>>>>>>> Silently corrupt data is bad, but this is the decision we made
at
>>>>>>> the beginning when design Spark behaviors. Whenever an error
occurs, Spark
>>>>>>> attempts to return null instead of runtime exception. Recently
we provide
>>>>>>> configs to make Spark fail at runtime for overflow, but that's
another
>>>>>>> story. Silently corrupt data is bad, runtime exception is bad,
and
>>>>>>> forbidding all the table insertions that may fail(even with very
little
>>>>>>> possibility) is also bad. We have to make trade-offs. The trade-offs
we
>>>>>>> made in this proposal are:
>>>>>>> * forbid table insertions that are very like to fail, at compile
>>>>>>> time. (things like writing string values to int column)
>>>>>>> * allow table insertions that are not that likely to fail. If
the
>>>>>>> data is wrong, don't fail, insert null.
>>>>>>> * provide a config to fail the insertion at runtime if the data
is
>>>>>>> wrong.
>>>>>>>
>>>>>>> >  But the new behavior is only applied in DataSourceV2, so
it won’t
>>>>>>> affect existing jobs until sources move to v2 and break other
behavior
>>>>>>> anyway.
>>>>>>> When users write SQL queries, they don't care if a table is backed
>>>>>>> by Data Source V1 or V2. We should make sure the table insertion
behavior
>>>>>>> is consistent and reasonable. Furthermore, users may even not
care if the
>>>>>>> SQL queries are run in Spark or other RDBMS, it's better to follow
SQL
>>>>>>> standard instead of introducing a Spark-specific behavior.
>>>>>>>
>>>>>>> We are not talking about a small use case like allowing writing
>>>>>>> decimal literal to float column, we are talking about a big goal
to make
>>>>>>> Spark compliant to SQL standard, w.r.t.
>>>>>>> https://issues.apache.org/jira/browse/SPARK-26217 . This proposal
>>>>>>> is a sub-task of it, to make the table insertion behavior follow
SQL
>>>>>>> standard.
>>>>>>>
>>>>>>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rblue@netflix.com>
wrote:
>>>>>>>
>>>>>>>> I don’t think this is a good idea. Following the ANSI standard
is
>>>>>>>> usually fine, but here it would *silently corrupt data*.
>>>>>>>>
>>>>>>>> From your proposal doc, ANSI allows implicitly casting from
long
>>>>>>>> to int (any numeric type to any other numeric type) and inserts
>>>>>>>> NULL when a value overflows. That would drop data values
and is not safe.
>>>>>>>>
>>>>>>>> Fixing the silent corruption by adding a runtime exception
is not a
>>>>>>>> good option, either. That puts off the problem until much
of the job has
>>>>>>>> completed, instead of catching the error at analysis time.
It is better to
>>>>>>>> catch this earlier during analysis than to run most of a
job and then fail.
>>>>>>>>
>>>>>>>> In addition, part of the justification for using the ANSI
standard
>>>>>>>> is to avoid breaking existing jobs. But the new behavior
is only applied in
>>>>>>>> DataSourceV2, so it won’t affect existing jobs until sources
move to v2 and
>>>>>>>> break other behavior anyway.
>>>>>>>>
>>>>>>>> I think that the correct solution is to go with the existing
>>>>>>>> validation rules that require explicit casts to truncate
values.
>>>>>>>>
>>>>>>>> That still leaves the use case that motivated this proposal,
which
>>>>>>>> is that floating point literals are parsed as decimals and
fail simple
>>>>>>>> insert statements. We already came up with two alternatives
to fix that
>>>>>>>> problem in the DSv2 sync and I think it is a better idea
to go with one of
>>>>>>>> those instead of “fixing” Spark in a way that will corrupt
data or cause
>>>>>>>> runtime failures.
>>>>>>>>
>>>>>>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I have heard about many complaints about the old table
insertion
>>>>>>>>> behavior. Blindly casting everything will leak the user
mistake to a late
>>>>>>>>> stage of the data pipeline, and make it very hard to
debug. When a user
>>>>>>>>> writes string values to an int column, it's probably
a mistake and the
>>>>>>>>> columns are misordered in the INSERT statement. We should
fail the query
>>>>>>>>> earlier and ask users to fix the mistake.
>>>>>>>>>
>>>>>>>>> In the meanwhile, I agree that the new table insertion
behavior we
>>>>>>>>> introduced for Data Source V2 is too strict. It may fail
valid queries
>>>>>>>>> unexpectedly.
>>>>>>>>>
>>>>>>>>> In general, I support the direction of following the
ANSI SQL
>>>>>>>>> standard. But I'd like to do it with 2 steps:
>>>>>>>>> 1. only add cast when the assignment rule is satisfied.
This
>>>>>>>>> should be the default behavior and we should provide
a legacy config to
>>>>>>>>> restore to the old behavior.
>>>>>>>>> 2. fail the cast operation at runtime if overflow happens.
AFAIK
>>>>>>>>> Marco Gaido is working on it already. This will have
a config as well and
>>>>>>>>> by default we still return null.
>>>>>>>>>
>>>>>>>>> After doing this, the default behavior will be slightly
different
>>>>>>>>> from the SQL standard (cast can return null), and users
can turn on the
>>>>>>>>> ANSI mode to fully follow the SQL standard. This is much
better than before
>>>>>>>>> and should prevent a lot of user mistakes. It's also
a reasonable choice to
>>>>>>>>> me to not throw exceptions at runtime by default, as
it's usually bad for
>>>>>>>>> long-running jobs.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Wenchen
>>>>>>>>>
>>>>>>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
>>>>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I would like to discuss the table insertion behavior
of Spark. In
>>>>>>>>>> the current data source V2, only UpCast is allowed
for table insertion. I
>>>>>>>>>> think following ANSI SQL is a better idea.
>>>>>>>>>> For more information, please read the Discuss: Follow
ANSI SQL
>>>>>>>>>> on table insertion
>>>>>>>>>> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing>
>>>>>>>>>> Please let me know if you have any thoughts on this.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Gengliang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>

Mime
View raw message