spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin" <r...@databricks.com>
Subject Re: [Discuss] Follow ANSI SQL on table insertion
Date Wed, 31 Jul 2019 17:39:41 GMT
OK to push back: "disagreeing with the premise that we can afford to not be maximal on standard
3. The correctness of the data is non-negotiable, and whatever solution we settle on cannot
silently adjust the user’s data under any circumstances."

This blanket statement sounds great on surface, but there are a lot of subtleties. "Correctness"
is absolutely important, but engineering/prod development are often about tradeoffs, and the
industry has consistently traded correctness for performance or convenience, e.g. overflow
checks, null pointers, consistency in databases ...

It all depends on the use cases and to what degree use cases can tolerate. For example, while
I want my data engineering production pipeline to throw any error when the data doesn't match
my expectations (e.g. type widening, overflow), if I'm doing some quick and dirty data science,
I don't want the job to just fail due to outliers.

On Wed, Jul 31, 2019 at 10:13 AM, Matt Cheah < mcheah@palantir.com > wrote:

> 
> 
> 
> Sorry I meant the current behavior for V2, which fails the query
> compilation if the cast is not safe.
> 
> 
> 
>  
> 
> 
> 
> Agreed that a separate discussion about overflow might be warranted. I’m
> surprised we don’t throw an error now, but it might be warranted to do so.
> 
> 
> 
> 
>  
> 
> 
> 
> -Matt Cheah
> 
> 
> 
>  
> 
> 
> 
> *From:* Reynold Xin < rxin@databricks.com >
> *Date:* Wednesday, July 31, 2019 at 9:58 AM
> *To:* Matt Cheah < mcheah@palantir.com >
> *Cc:* Russell Spitzer < russell.spitzer@gmail.com >, Takeshi Yamamuro < linguin.m.s@gmail.com
> >, Gengliang Wang < gengliang.wang@databricks.com >, Ryan Blue < rblue@netflix.com
> >, Spark dev list < dev@spark.apache.org >, Hyukjin Kwon < gurwls223@gmail.com
> >, Wenchen Fan < cloud0fan@gmail.com >
> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Matt what do you mean by maximizing 3, while allowing not throwing errors
> when any operations overflow? Those two seem contradicting.
> 
> 
> 
> 
>  
> 
> 
> 
> 
>  
> 
> 
> 
> On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah < mcheah@palantir.com > wrote:
> 
> 
> 
>> 
>> 
>> I’m -1, simply from disagreeing with the premise that we can afford to not
>> be maximal on standard 3. The correctness of the data is non-negotiable,
>> and whatever solution we settle on cannot silently adjust the user’s data
>> under any circumstances.
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> I think the existing behavior is fine, or perhaps the behavior can be
>> flagged by the destination writer at write time.
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> -Matt Cheah
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> *From:* Hyukjin Kwon < gurwls223@gmail.com >
>> *Date:* Monday, July 29, 2019 at 11:33 PM
>> *To:* Wenchen Fan < cloud0fan@gmail.com >
>> *Cc:* Russell Spitzer < russell.spitzer@gmail.com >, Takeshi Yamamuro <
linguin.m.s@gmail.com
>> >, Gengliang Wang < gengliang.wang@databricks.com >, Ryan Blue < rblue@netflix.com
>> >, Spark dev list < dev@spark.apache.org >
>> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> From my look, +1 on the proposal, considering ASCI and other DBMSes in
>> general.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 2019 년 7 월 30 일 ( 화 ) 오후 3:21, Wenchen Fan < cloud0fan@gmail.com >
님이 작성 :
>> 
>> 
>> 
>> 
>>> 
>>> 
>>> We can add a config for a certain behavior if it makes sense, but the most
>>> important thing we want to reach an agreement here is: what should be the
>>> default behavior?
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>> Let's explore the solution space of table insertion behavior first:
>>> 
>>> 
>>> 
>>> 
>>> At compile time,
>>> 
>>> 
>>> 
>>> 
>>> 1. always add cast
>>> 
>>> 
>>> 
>>> 
>>> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
>>> int is forbidden but long to int is allowed)
>>> 
>>> 
>>> 
>>> 
>>> 3. only add cast if it's 100% safe
>>> 
>>> 
>>> 
>>> 
>>> At runtime,
>>> 
>>> 
>>> 
>>> 
>>> 1. return null for invalid operations
>>> 
>>> 
>>> 
>>> 
>>> 2. throw exceptions at runtime for invalid operations
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>> The standards to evaluate a solution:
>>> 
>>> 
>>> 
>>> 
>>> 1. How robust the query execution is. For example, users usually don't
>>> want to see the query fails midway.
>>> 
>>> 
>>> 
>>> 
>>> 2. how tolerant to user queries. For example, a user would like to write
>>> long values to an int column as he knows all the long values won't exceed
>>> int range.
>>> 
>>> 
>>> 
>>> 
>>> 3. How clean the result is. For example, users usually don't want to see
>>> silently corrupted data (null values).
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>> The current Spark behavior for Data Source V1 tables: always add cast and
>>> return null for invalid operations. This maximizes standard 1 and 2, but
>>> the result is least clean and users are very likely to see silently
>>> corrupted data (null values).
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>> The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
>>> only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
>>> queries may fail to compile, even if these queries can run on other SQL
>>> systems. Note that, people can still see silently corrupted data because
>>> cast is not the only one that can return corrupted data. Simple operations
>>> like ADD can also return corrected data if overflow happens. e.g. INSERT
>>> INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2 
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>> The proposal here: add cast following ANSI SQL store assignment rule, and
>>> return null for invalid operations. This maximizes standard 1, and also
>>> fits standard 2 well: if a query can't compile in Spark, it usually can't
>>> compile in other mainstream databases as well. I think that's tolerant
>>> enough. For standard 3, this proposal doesn't maximize it but can avoid
>>> many invalid operations already.
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>> Technically we can't make the result 100% clean at compile-time, we have
>>> to handle things like overflow at runtime. I think the new proposal makes
>>> more sense as the default behavior.
>>> 
>>> 
>>> 
>>> 
>>>   
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer < russell.spitzer@gmail.com
>>> > wrote:
>>> 
>>> 
>>> 
>>>> 
>>>> 
>>>> I understand spark is making the decisions, i'm say the actual final
>>>> effect of the null decision would be different depending on the insertion
>>>> target if the target has different behaviors for null.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>  
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan < cloud0fan@gmail.com >
wrote:
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> > I'm a big -1 on null values for invalid casts.
>>>>> 
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> This is why we want to introduce the ANSI mode, so that invalid cast
fails
>>>>> at runtime. But we have to keep the null behavior for a while, to keep
>>>>> backward compatibility. Spark returns null for invalid cast since the
>>>>> first day of Spark SQL, we can't just change it without a way to restore
>>>>> to the old behavior.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> I'm OK with adding a strict mode for the upcast behavior in table
>>>>> insertion, but I don't agree with making it the default. The default
>>>>> behavior should be either the ANSI SQL behavior or the legacy Spark
>>>>> behavior.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> > other modes should be allowed only with strict warning the behavior
will
>>>>> be determined by the underlying sink.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Seems there is some misunderstanding. The table insertion behavior is
>>>>> fully controlled by Spark. Spark decides when to add cast and Spark
>>>>> decided whether invalid cast should return null or fail. The sink is
only
>>>>> responsible for writing data, not the type coercion/cast stuff.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer < russell.spitzer@gmail.com
>>>>> > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I'm a big -1 on null values for invalid casts. This can lead to a
lot of
>>>>>> even more unexpected errors and runtime behavior since null is 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 1. Not allowed in all schemas (Leading to a runtime error anyway)
>>>>>> 2. Is the same as delete in some systems (leading to data loss)
>>>>>> 
>>>>>> And this would be dependent on the sink being used. Spark won't just
be
>>>>>> interacting with ANSI compliant sinks so I think it makes much more
sense
>>>>>> to be strict. I think Upcast mode is a sensible default and other
modes
>>>>>> should be allowed only with strict warning the behavior will be determined
>>>>>> by the underlying sink.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro < linguin.m.s@gmail.com
>
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hi, all
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> +1 for implementing this new store cast mode.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> From a viewpoint of DBMS users, this cast is pretty common for
INSERTs and
>>>>>>> I think this functionality could
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> promote migrations from existing DBMSs to Spark. 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> The most important thing for DBMS users is that they could optionally
>>>>>>> choose this mode when inserting data.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Therefore, I think it might be okay that the two modes (the current
upcast
>>>>>>> mode and the proposed store cast mode)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> co-exist for INSERTs. (There is a room to discuss which mode 
is enabled
>>>>>>> by default though...)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> IMHO we'll provide three behaviours below for INSERTs;
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  - upcast mode
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  - ANSI store cast mode and runtime exceptions thrown for invalid
values
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  - ANSI store cast mode and null filled for invalid values
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang < gengliang.wang@databricks.com
>>>>>>> > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Ryan,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks for the suggestions on the proposal and doc.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Currently, there is no data type validation in table insertion
of V1. We
>>>>>>>> are on the same page that we should improve it. But using
UpCast is from
>>>>>>>> one extreme to another. It is possible that many queries
are broken after
>>>>>>>> upgrading to Spark 3.0. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The rules of UpCast are too strict. E.g. it doesn't allow
assigning
>>>>>>>> Timestamp type to Date Type, as there will be "precision
loss". To me, the
>>>>>>>> type coercion is reasonable and the "precision loss" is under
expectation.
>>>>>>>> This is very common in other SQL engines. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> As long as Spark is following the ANSI SQL store assignment
rules, it is
>>>>>>>> users' responsibility to take good care of the type coercion
in data
>>>>>>>> writing. I think it's the right decision.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> > But the new behavior is only applied in DataSourceV2,
so it won’t affect
>>>>>>>> existing jobs until sources move to v2 and break other behavior
anyway.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Eventually, most sources are supposed to be migrated to DataSourceV2
V2. I
>>>>>>>> think we can discuss and make a decision now.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> > Fixing the silent corruption by adding a runtime exception
is not a good
>>>>>>>> option, either. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The new optional mode proposed in https://issues.apache.org/jira/browse/SPARK-28512
>>>>>>>> [issues.apache.org] (
>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D28512&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=kluCSIsFxJd_F1ljwRLu_dVRBvOYekEogDH4KWhfkkA&e=
>>>>>>>> ) is disabled by default. This should be fine.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan < cloud0fan@gmail.com
> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I don't agree with handling literal values specially.
Although Postgres
>>>>>>>>> does it, I can't find anything about it in the SQL standard.
And it
>>>>>>>>> introduces inconsistent behaviors which may be strange
to users:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> * What about something like "INSERT INTO t SELECT float_col
+ 1.1"?
>>>>>>>>> * The same insert with a decimal column as input will
fail even when a
>>>>>>>>> decimal literal would succeed
>>>>>>>>> * Similar insert queries with "literal" inputs can be
constructed through
>>>>>>>>> layers of indirection via views, inline views, CTEs,
unions, etc. Would
>>>>>>>>> those decimals be treated as columns and fail or would
we attempt to make
>>>>>>>>> them succeed as well? Would users find this behavior
surprising?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Silently corrupt data is bad, but this is the decision
we made at the
>>>>>>>>> beginning when design Spark behaviors. Whenever an error
occurs, Spark
>>>>>>>>> attempts to return null instead of runtime exception.
Recently we provide
>>>>>>>>> configs to make Spark fail at runtime for overflow, but
that's another
>>>>>>>>> story. Silently corrupt data is bad, runtime exception
is bad, and
>>>>>>>>> forbidding all the table insertions that may fail(even
with very little
>>>>>>>>> possibility) is also bad. We have to make trade-offs.
The trade-offs we
>>>>>>>>> made in this proposal are:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> * forbid table insertions that are very like to fail,
at compile time.
>>>>>>>>> (things like writing string values to int column)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> * allow table insertions that are not that likely to
fail. If the data is
>>>>>>>>> wrong, don't fail, insert null.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> * provide a config to fail the insertion at runtime if
the data is wrong.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> >  But the new behavior is only applied in DataSourceV2,
so it won’t
>>>>>>>>> affect existing jobs until sources move to v2 and break
other behavior
>>>>>>>>> anyway.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> When users write SQL queries, they don't care if a table
is backed by Data
>>>>>>>>> Source V1 or V2. We should make sure the table insertion
behavior is
>>>>>>>>> consistent and reasonable. Furthermore, users may even
not care if the SQL
>>>>>>>>> queries are run in Spark or other RDBMS, it's better
to follow SQL
>>>>>>>>> standard instead of introducing a Spark-specific behavior.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> We are not talking about a small use case like allowing
writing decimal
>>>>>>>>> literal to float column, we are talking about a big goal
to make Spark
>>>>>>>>> compliant to SQL standard, w.r.t. https://issues.apache.org/jira/browse/SPARK-26217
>>>>>>>>> [issues.apache.org] (
>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D26217&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=Y9ReXszbCYw7dLvhzbVJD7wYppANZpZtUkoSNJ-MJC4&e=
>>>>>>>>> ) . This proposal is a sub-task of it, to make the table
insertion
>>>>>>>>> behavior follow SQL standard.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue < rblue@netflix.com
> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I don’t think this is a good idea. Following the
ANSI standard is usually
>>>>>>>>>> fine, but here it would *silently corrupt data*.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> From your proposal doc, ANSI allows implicitly casting
from long to int (any
>>>>>>>>>> numeric type to any other numeric type) and inserts
NULL when a value
>>>>>>>>>> overflows. That would drop data values and is not
safe.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Fixing the silent corruption by adding a runtime
exception is not a good
>>>>>>>>>> option, either. That puts off the problem until much
of the job has
>>>>>>>>>> completed, instead of catching the error at analysis
time. It is better to
>>>>>>>>>> catch this earlier during analysis than to run most
of a job and then
>>>>>>>>>> fail.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> In addition, part of the justification for using
the ANSI standard is to
>>>>>>>>>> avoid breaking existing jobs. But the new behavior
is only applied in
>>>>>>>>>> DataSourceV2, so it won’t affect existing jobs
until sources move to v2
>>>>>>>>>> and break other behavior anyway.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I think that the correct solution is to go with the
existing validation
>>>>>>>>>> rules that require explicit casts to truncate values.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> That still leaves the use case that motivated this
proposal, which is that
>>>>>>>>>> floating point literals are parsed as decimals and
fail simple insert
>>>>>>>>>> statements. We already came up with two alternatives
to fix that problem
>>>>>>>>>> in the DSv2 sync and I think it is a better idea
to go with one of those
>>>>>>>>>> instead of “fixing” Spark in a way that will
corrupt data or cause runtime
>>>>>>>>>> failures.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <
cloud0fan@gmail.com > wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I have heard about many complaints about the
old table insertion behavior.
>>>>>>>>>>> Blindly casting everything will leak the user
mistake to a late stage of
>>>>>>>>>>> the data pipeline, and make it very hard to debug.
When a user writes
>>>>>>>>>>> string values to an int column, it's probably
a mistake and the columns
>>>>>>>>>>> are misordered in the INSERT statement. We should
fail the query earlier
>>>>>>>>>>> and ask users to fix the mistake.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> In the meanwhile, I agree that the new table
insertion behavior we
>>>>>>>>>>> introduced for Data Source V2 is too strict.
It may fail valid queries
>>>>>>>>>>> unexpectedly.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> In general, I support the direction of following
the ANSI SQL standard.
>>>>>>>>>>> But I'd like to do it with 2 steps:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 1. only add cast when the assignment rule is
satisfied. This should be the
>>>>>>>>>>> default behavior and we should provide a legacy
config to restore to the
>>>>>>>>>>> old behavior.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 2. fail the cast operation at runtime if overflow
happens. AFAIK Marco
>>>>>>>>>>> Gaido is working on it already. This will have
a config as well and by
>>>>>>>>>>> default we still return null.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> After doing this, the default behavior will be
slightly different from the
>>>>>>>>>>> SQL standard (cast can return null), and users
can turn on the ANSI mode
>>>>>>>>>>> to fully follow the SQL standard. This is much
better than before and
>>>>>>>>>>> should prevent a lot of user mistakes. It's also
a reasonable choice to me
>>>>>>>>>>> to not throw exceptions at runtime by default,
as it's usually bad for
>>>>>>>>>>> long-running jobs.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Wenchen 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang
< gengliang.wang@databricks.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>  
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I would like to discuss the table insertion
behavior of Spark. In the
>>>>>>>>>>>> current data source V2, only UpCast is allowed
for table insertion. I
>>>>>>>>>>>> think following ANSI SQL is a better idea.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> For more information, please read the Discuss:
Follow ANSI SQL on table
>>>>>>>>>>>> insertion [docs.google.com] (
>>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1b9nnWWbKVDRp7lpzhQS1buv1-5FlDzWIZY2ApFs5rBcGI_edit-3Fusp-3Dsharing&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=bLJxfpdKatSv5gaYIKVWx29RmEBKz-DTtRwuYd7lYks&e=
>>>>>>>>>>>> )
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Please let me know if you have any thoughts
on this.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>  
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Gengliang
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Ryan Blue
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Software Engineer
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Netflix
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ---
>>>>>>> Takeshi Yamamuro
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>
Mime
View raw message