spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: [Discuss] Follow ANSI SQL on table insertion
Date Thu, 01 Aug 2019 00:55:01 GMT
I am sorry I am asking a question without reading whole discussion after I
replied.

But why does Spark specifically needs to to it differently while ANCI
standard, other DBMSes and other systems do?
If there isn't a specific issue to Spark, that basically says they are all
wrong.


2019년 8월 1일 (목) 오전 9:31, Russell Spitzer <russell.spitzer@gmail.com>님이
작성:

> Another solution along those lines that I know we implemented for limited
> precision types is just to do a loud warning whenever you do such a cast.
> IE: Warning we are casting X to Y this may result in data loss.
>
> On Wed, Jul 31, 2019 at 7:08 PM Russell Spitzer <russell.spitzer@gmail.com>
> wrote:
>
>> I would argue "null" doesn't have to mean invalid. It could mean missing
>> or deleted record. There is a lot of difference between missing record and
>> invalid record.
>>
>> I definitely have no problem with two modes, but I think setting a
>> parameter to enable lossy conversions is a fine tradeoff to avoid data loss
>> for others. The impact then for those who don't care about lossy casting is
>> an analysis level message "Types don't match, to enable lossy casting set
>> some parameter" while the impact in the other direction is possibly
>> invisible until it hits something critical downstream.
>>
>> On Wed, Jul 31, 2019 at 6:50 PM Ryan Blue <rblue@netflix.com> wrote:
>>
>>> > you guys seem to be arguing no those users don't know what they are
>>> doing and they should not exist.
>>>
>>> I'm not arguing that they don't exist. Just that the disproportionate
>>> impact of awareness about this behavior is much worse for people that don't
>>> know about it. And there are a lot of those people as well.
>>>
>>> On Wed, Jul 31, 2019 at 4:48 PM Ryan Blue <rblue@netflix.com> wrote:
>>>
>>>> > "between a runtime error and an analysis-time error" → I think one
of
>>>> those should be the default.
>>>>
>>>> If you're saying that the default should be an error of some kind, then
>>>> I think we agree. I'm also fine with having a mode that allows turning off
>>>> the error and silently replacing values with NULL... as long as it isn't
>>>> the default and I can set the default for my platform to an analysis-time
>>>> error.
>>>>
>>>> On Wed, Jul 31, 2019 at 4:42 PM Russell Spitzer <
>>>> russell.spitzer@gmail.com> wrote:
>>>>
>>>>> I definitely view it as silently corrupting. If i'm copying over a
>>>>> dataset where some elements are null and some have values, how do I
>>>>> differentiate between my expected nulls and those that were added in
>>>>> silently in the cast?
>>>>>
>>>>> On Wed, Jul 31, 2019 at 6:15 PM Reynold Xin <rxin@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> "between a runtime error and an analysis-time error" → I think
one of
>>>>>> those should be the default.
>>>>>>
>>>>>> Maybe we are talking past each other or I wasn't explaining clearly,
>>>>>> but I don't think you understand what I said and the use cases out
there. I
>>>>>> as an end user could very well be fully aware of the consequences
of
>>>>>> exceptional values but I can choose to ignore them. This is especially
>>>>>> common for data scientists who are doing some quick and dirty analysis
or
>>>>>> exploration. You can't deny this large class of use cases out there
>>>>>> (probably makes up half of Spark use cases actually).
>>>>>>
>>>>>> Also writing out the exceptional cases as null are not silently
>>>>>> corrupting them. The engine is sending an explicit signal that the
value is
>>>>>> no longer valid given the constraint.
>>>>>>
>>>>>> Not sure if this is the best analogy, but think about checked
>>>>>> exceptions in Java. It's great for writing low level code in which
error
>>>>>> handling is paramount, e.g. storage systems, network layers. But
in most
>>>>>> high level applications people just write boilerplate catches that
are
>>>>>> no-ops, because they have other priorities and they can tolerate
>>>>>> mishandling of exceptions, although often maybe they shouldn't.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 31, 2019 at 2:55 PM, Ryan Blue <rblue@netflix.com>
wrote:
>>>>>>
>>>>>>> Another important aspect of this problem is whether a user is
>>>>>>> conscious of the cast that is inserted by Spark. Most of the
time, users
>>>>>>> are not aware of casts that are implicitly inserted, and that
means
>>>>>>> replacing values with NULL would be a very surprising behavior.
The impact
>>>>>>> of this choice affects users disproportionately: someone that
knows about
>>>>>>> inserted casts is mildly annoyed when required to add an explicit
cast, but
>>>>>>> a user that doesn't know an inserted cast is dropping values
is very
>>>>>>> negatively impacted and may not discover the problem until it
is too late.
>>>>>>>
>>>>>>> That disproportionate impact is what makes me think that it is
not
>>>>>>> okay for Spark to silently replace values with NULL, even if
that's what
>>>>>>> ANSI would allow. Other databases also have the ability to reject
null
>>>>>>> values in tables, providing extra insurance against the problem,
but Spark
>>>>>>> doesn't have required columns in its DDL.
>>>>>>>
>>>>>>> So while I agree with Reynold that there is a trade-off, I think
>>>>>>> that trade-off makes the choice between a runtime error and an
>>>>>>> analysis-time error. I'm okay with either a runtime error as
the default or
>>>>>>> an analysis error as the default, as long as there is a setting
that allows
>>>>>>> me to choose one for my deployment.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 31, 2019 at 10:39 AM Reynold Xin <rxin@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> OK to push back: "disagreeing with the premise that we can afford
to
>>>>>>> not be maximal on standard 3. The correctness of the data is
>>>>>>> non-negotiable, and whatever solution we settle on cannot silently
adjust
>>>>>>> the user’s data under any circumstances."
>>>>>>>
>>>>>>> This blanket statement sounds great on surface, but there are
a lot
>>>>>>> of subtleties. "Correctness" is absolutely important, but engineering/prod
>>>>>>> development are often about tradeoffs, and the industry has consistently
>>>>>>> traded correctness for performance or convenience, e.g. overflow
checks,
>>>>>>> null pointers, consistency in databases ...
>>>>>>>
>>>>>>> It all depends on the use cases and to what degree use cases
can
>>>>>>> tolerate. For example, while I want my data engineering production
pipeline
>>>>>>> to throw any error when the data doesn't match my expectations
(e.g. type
>>>>>>> widening, overflow), if I'm doing some quick and dirty data science,
I
>>>>>>> don't want the job to just fail due to outliers.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 31, 2019 at 10:13 AM, Matt Cheah <mcheah@palantir.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Sorry I meant the current behavior for V2, which fails the query
>>>>>>> compilation if the cast is not safe.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Agreed that a separate discussion about overflow might be warranted.
>>>>>>> I’m surprised we don’t throw an error now, but it might be
warranted to do
>>>>>>> so.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Matt Cheah
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From: *Reynold Xin <rxin@databricks.com>
>>>>>>> *Date: *Wednesday, July 31, 2019 at 9:58 AM
>>>>>>> *To: *Matt Cheah <mcheah@palantir.com>
>>>>>>> *Cc: *Russell Spitzer <russell.spitzer@gmail.com>, Takeshi
Yamamuro
>>>>>>> <linguin.m.s@gmail.com>, Gengliang Wang <
>>>>>>> gengliang.wang@databricks.com>, Ryan Blue <rblue@netflix.com>,
>>>>>>> Spark dev list <dev@spark.apache.org>, Hyukjin Kwon <
>>>>>>> gurwls223@gmail.com>, Wenchen Fan <cloud0fan@gmail.com>
>>>>>>> *Subject: *Re: [Discuss] Follow ANSI SQL on table insertion
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Matt what do you mean by maximizing 3, while allowing not throwing
>>>>>>> errors when any operations overflow? Those two seem contradicting.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah <mcheah@palantir.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I’m -1, simply from disagreeing with the premise that we can
afford
>>>>>>> to not be maximal on standard 3. The correctness of the data
is
>>>>>>> non-negotiable, and whatever solution we settle on cannot silently
adjust
>>>>>>> the user’s data under any circumstances.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think the existing behavior is fine, or perhaps the behavior
can
>>>>>>> be flagged by the destination writer at write time.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Matt Cheah
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From: *Hyukjin Kwon <gurwls223@gmail.com>
>>>>>>> *Date: *Monday, July 29, 2019 at 11:33 PM
>>>>>>> *To: *Wenchen Fan <cloud0fan@gmail.com>
>>>>>>> *Cc: *Russell Spitzer <russell.spitzer@gmail.com>, Takeshi
Yamamuro
>>>>>>> <linguin.m.s@gmail.com>, Gengliang Wang <
>>>>>>> gengliang.wang@databricks.com>, Ryan Blue <rblue@netflix.com>,
>>>>>>> Spark dev list <dev@spark.apache.org>
>>>>>>> *Subject: *Re: [Discuss] Follow ANSI SQL on table insertion
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From my look, +1 on the proposal, considering ASCI and other
DBMSes
>>>>>>> in general.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2019년 7월 30일 (화) 오후 3:21, Wenchen Fan <cloud0fan@gmail.com>님이
작성:
>>>>>>>
>>>>>>> We can add a config for a certain behavior if it makes sense,
but
>>>>>>> the most important thing we want to reach an agreement here is:
what should
>>>>>>> be the default behavior?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Let's explore the solution space of table insertion behavior
first:
>>>>>>>
>>>>>>> At compile time,
>>>>>>>
>>>>>>> 1. always add cast
>>>>>>>
>>>>>>> 2. add cast following the ASNI SQL store assignment rule (e.g.
>>>>>>> string to int is forbidden but long to int is allowed)
>>>>>>>
>>>>>>> 3. only add cast if it's 100% safe
>>>>>>>
>>>>>>> At runtime,
>>>>>>>
>>>>>>> 1. return null for invalid operations
>>>>>>>
>>>>>>> 2. throw exceptions at runtime for invalid operations
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The standards to evaluate a solution:
>>>>>>>
>>>>>>> 1. How robust the query execution is. For example, users usually
>>>>>>> don't want to see the query fails midway.
>>>>>>>
>>>>>>> 2. how tolerant to user queries. For example, a user would like
to
>>>>>>> write long values to an int column as he knows all the long values
won't
>>>>>>> exceed int range.
>>>>>>>
>>>>>>> 3. How clean the result is. For example, users usually don't
want to
>>>>>>> see silently corrupted data (null values).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The current Spark behavior for Data Source V1 tables: always
add
>>>>>>> cast and return null for invalid operations. This maximizes standard
1 and
>>>>>>> 2, but the result is least clean and users are very likely to
see silently
>>>>>>> corrupted data (null values).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The current Spark behavior for Data Source V2 tables (new in
Spark
>>>>>>> 3.0): only add cast if it's 100% safe. This maximizes standard
1 and 3, but
>>>>>>> many queries may fail to compile, even if these queries can run
on other
>>>>>>> SQL systems. Note that, people can still see silently corrupted
data
>>>>>>> because cast is not the only one that can return corrupted data.
Simple
>>>>>>> operations like ADD can also return corrected data if overflow
happens.
>>>>>>> e.g. INSERT INTO t1 (intCol) SELECT anotherIntCol + 100 FROM
t2
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The proposal here: add cast following ANSI SQL store assignment
>>>>>>> rule, and return null for invalid operations. This maximizes
standard 1,
>>>>>>> and also fits standard 2 well: if a query can't compile in Spark,
it
>>>>>>> usually can't compile in other mainstream databases as well.
I think that's
>>>>>>> tolerant enough. For standard 3, this proposal doesn't maximize
it but can
>>>>>>> avoid many invalid operations already.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Technically we can't make the result 100% clean at compile-time,
we
>>>>>>> have to handle things like overflow at runtime. I think the new
proposal
>>>>>>> makes more sense as the default behavior.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer <
>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>
>>>>>>> I understand spark is making the decisions, i'm say the actual
final
>>>>>>> effect of the null decision would be different depending on the
insertion
>>>>>>> target if the target has different behaviors for null.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > I'm a big -1 on null values for invalid casts.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is why we want to introduce the ANSI mode, so that invalid
cast
>>>>>>> fails at runtime. But we have to keep the null behavior for a
while, to
>>>>>>> keep backward compatibility. Spark returns null for invalid cast
since the
>>>>>>> first day of Spark SQL, we can't just change it without a way
to restore to
>>>>>>> the old behavior.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'm OK with adding a strict mode for the upcast behavior in table
>>>>>>> insertion, but I don't agree with making it the default. The
default
>>>>>>> behavior should be either the ANSI SQL behavior or the legacy
Spark
>>>>>>> behavior.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > other modes should be allowed only with strict warning the
>>>>>>> behavior will be determined by the underlying sink.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Seems there is some misunderstanding. The table insertion behavior
>>>>>>> is fully controlled by Spark. Spark decides when to add cast
and Spark
>>>>>>> decided whether invalid cast should return null or fail. The
sink is only
>>>>>>> responsible for writing data, not the type coercion/cast stuff.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <
>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>
>>>>>>> I'm a big -1 on null values for invalid casts. This can lead
to a
>>>>>>> lot of even more unexpected errors and runtime behavior since
null is
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1. Not allowed in all schemas (Leading to a runtime error anyway)
>>>>>>> 2. Is the same as delete in some systems (leading to data loss)
>>>>>>>
>>>>>>> And this would be dependent on the sink being used. Spark won't
just
>>>>>>> be interacting with ANSI compliant sinks so I think it makes
much more
>>>>>>> sense to be strict. I think Upcast mode is a sensible default
and other
>>>>>>> modes should be allowed only with strict warning the behavior
will be
>>>>>>> determined by the underlying sink.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro <
>>>>>>> linguin.m.s@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi, all
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> +1 for implementing this new store cast mode.
>>>>>>>
>>>>>>> From a viewpoint of DBMS users, this cast is pretty common for
>>>>>>> INSERTs and I think this functionality could
>>>>>>>
>>>>>>> promote migrations from existing DBMSs to Spark.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The most important thing for DBMS users is that they could
>>>>>>> optionally choose this mode when inserting data.
>>>>>>>
>>>>>>> Therefore, I think it might be okay that the two modes (the current
>>>>>>> upcast mode and the proposed store cast mode)
>>>>>>>
>>>>>>> co-exist for INSERTs. (There is a room to discuss which mode
 is
>>>>>>> enabled by default though...)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> IMHO we'll provide three behaviours below for INSERTs;
>>>>>>>
>>>>>>>  - upcast mode
>>>>>>>
>>>>>>>  - ANSI store cast mode and runtime exceptions thrown for invalid
>>>>>>> values
>>>>>>>
>>>>>>>  - ANSI store cast mode and null filled for invalid values
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang <
>>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>>
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the suggestions on the proposal and doc.
>>>>>>>
>>>>>>> Currently, there is no data type validation in table insertion
of
>>>>>>> V1. We are on the same page that we should improve it. But using
UpCast is
>>>>>>> from one extreme to another. It is possible that many queries
are broken
>>>>>>> after upgrading to Spark 3.0.
>>>>>>>
>>>>>>> The rules of UpCast are too strict. E.g. it doesn't allow assigning
>>>>>>> Timestamp type to Date Type, as there will be "precision loss".
To me, the
>>>>>>> type coercion is reasonable and the "precision loss" is under
expectation.
>>>>>>> This is very common in other SQL engines.
>>>>>>>
>>>>>>> As long as Spark is following the ANSI SQL store assignment rules,
>>>>>>> it is users' responsibility to take good care of the type coercion
in data
>>>>>>> writing. I think it's the right decision.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > But the new behavior is only applied in DataSourceV2, so
it won’t
>>>>>>> affect existing jobs until sources move to v2 and break other
behavior
>>>>>>> anyway.
>>>>>>>
>>>>>>> Eventually, most sources are supposed to be migrated to DataSourceV2
>>>>>>> V2. I think we can discuss and make a decision now.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > Fixing the silent corruption by adding a runtime exception
is not
>>>>>>> a good option, either.
>>>>>>>
>>>>>>> The new optional mode proposed in https://issues.apache.org/jira/browse/SPARK-28512
>>>>>>> [issues.apache.org]
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D28512&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=kluCSIsFxJd_F1ljwRLu_dVRBvOYekEogDH4KWhfkkA&e=>
is
>>>>>>> disabled by default. This should be fine.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I don't agree with handling literal values specially. Although
>>>>>>> Postgres does it, I can't find anything about it in the SQL standard.
And
>>>>>>> it introduces inconsistent behaviors which may be strange to
users:
>>>>>>>
>>>>>>> * What about something like "INSERT INTO t SELECT float_col +
1.1"?
>>>>>>> * The same insert with a decimal column as input will fail even
when
>>>>>>> a decimal literal would succeed
>>>>>>> * Similar insert queries with "literal" inputs can be constructed
>>>>>>> through layers of indirection via views, inline views, CTEs,
unions, etc.
>>>>>>> Would those decimals be treated as columns and fail or would
we attempt to
>>>>>>> make them succeed as well? Would users find this behavior surprising?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Silently corrupt data is bad, but this is the decision we made
at
>>>>>>> the beginning when design Spark behaviors. Whenever an error
occurs, Spark
>>>>>>> attempts to return null instead of runtime exception. Recently
we provide
>>>>>>> configs to make Spark fail at runtime for overflow, but that's
another
>>>>>>> story. Silently corrupt data is bad, runtime exception is bad,
and
>>>>>>> forbidding all the table insertions that may fail(even with very
little
>>>>>>> possibility) is also bad. We have to make trade-offs. The trade-offs
we
>>>>>>> made in this proposal are:
>>>>>>>
>>>>>>> * forbid table insertions that are very like to fail, at compile
>>>>>>> time. (things like writing string values to int column)
>>>>>>>
>>>>>>> * allow table insertions that are not that likely to fail. If
the
>>>>>>> data is wrong, don't fail, insert null.
>>>>>>>
>>>>>>> * provide a config to fail the insertion at runtime if the data
is
>>>>>>> wrong.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> >  But the new behavior is only applied in DataSourceV2, so
it won’t
>>>>>>> affect existing jobs until sources move to v2 and break other
behavior
>>>>>>> anyway.
>>>>>>>
>>>>>>> When users write SQL queries, they don't care if a table is backed
>>>>>>> by Data Source V1 or V2. We should make sure the table insertion
behavior
>>>>>>> is consistent and reasonable. Furthermore, users may even not
care if the
>>>>>>> SQL queries are run in Spark or other RDBMS, it's better to follow
SQL
>>>>>>> standard instead of introducing a Spark-specific behavior.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We are not talking about a small use case like allowing writing
>>>>>>> decimal literal to float column, we are talking about a big goal
to make
>>>>>>> Spark compliant to SQL standard, w.r.t. https://issues.apache.org/jira/browse/SPARK-26217
>>>>>>> [issues.apache.org]
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D26217&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=Y9ReXszbCYw7dLvhzbVJD7wYppANZpZtUkoSNJ-MJC4&e=>
.
>>>>>>> This proposal is a sub-task of it, to make the table insertion
behavior
>>>>>>> follow SQL standard.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rblue@netflix.com>
wrote:
>>>>>>>
>>>>>>> I don’t think this is a good idea. Following the ANSI standard
is
>>>>>>> usually fine, but here it would *silently corrupt data*.
>>>>>>>
>>>>>>> From your proposal doc, ANSI allows implicitly casting from long
to
>>>>>>> int (any numeric type to any other numeric type) and inserts
NULL
>>>>>>> when a value overflows. That would drop data values and is not
safe.
>>>>>>>
>>>>>>> Fixing the silent corruption by adding a runtime exception is
not a
>>>>>>> good option, either. That puts off the problem until much of
the job has
>>>>>>> completed, instead of catching the error at analysis time. It
is better to
>>>>>>> catch this earlier during analysis than to run most of a job
and then fail.
>>>>>>>
>>>>>>> In addition, part of the justification for using the ANSI standard
>>>>>>> is to avoid breaking existing jobs. But the new behavior is only
applied in
>>>>>>> DataSourceV2, so it won’t affect existing jobs until sources
move to v2 and
>>>>>>> break other behavior anyway.
>>>>>>>
>>>>>>> I think that the correct solution is to go with the existing
>>>>>>> validation rules that require explicit casts to truncate values.
>>>>>>>
>>>>>>> That still leaves the use case that motivated this proposal,
which
>>>>>>> is that floating point literals are parsed as decimals and fail
simple
>>>>>>> insert statements. We already came up with two alternatives to
fix that
>>>>>>> problem in the DSv2 sync and I think it is a better idea to go
with one of
>>>>>>> those instead of “fixing” Spark in a way that will corrupt
data or cause
>>>>>>> runtime failures.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I have heard about many complaints about the old table insertion
>>>>>>> behavior. Blindly casting everything will leak the user mistake
to a late
>>>>>>> stage of the data pipeline, and make it very hard to debug. When
a user
>>>>>>> writes string values to an int column, it's probably a mistake
and the
>>>>>>> columns are misordered in the INSERT statement. We should fail
the query
>>>>>>> earlier and ask users to fix the mistake.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> In the meanwhile, I agree that the new table insertion behavior
we
>>>>>>> introduced for Data Source V2 is too strict. It may fail valid
queries
>>>>>>> unexpectedly.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> In general, I support the direction of following the ANSI SQL
>>>>>>> standard. But I'd like to do it with 2 steps:
>>>>>>>
>>>>>>> 1. only add cast when the assignment rule is satisfied. This
should
>>>>>>> be the default behavior and we should provide a legacy config
to restore to
>>>>>>> the old behavior.
>>>>>>>
>>>>>>> 2. fail the cast operation at runtime if overflow happens. AFAIK
>>>>>>> Marco Gaido is working on it already. This will have a config
as well and
>>>>>>> by default we still return null.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> After doing this, the default behavior will be slightly different
>>>>>>> from the SQL standard (cast can return null), and users can turn
on the
>>>>>>> ANSI mode to fully follow the SQL standard. This is much better
than before
>>>>>>> and should prevent a lot of user mistakes. It's also a reasonable
choice to
>>>>>>> me to not throw exceptions at runtime by default, as it's usually
bad for
>>>>>>> long-running jobs.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Wenchen
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
>>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I would like to discuss the table insertion behavior of Spark.
In
>>>>>>> the current data source V2, only UpCast is allowed for table
insertion. I
>>>>>>> think following ANSI SQL is a better idea.
>>>>>>>
>>>>>>> For more information, please read the Discuss: Follow ANSI SQL
on
>>>>>>> table insertion [docs.google.com]
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1b9nnWWbKVDRp7lpzhQS1buv1-5FlDzWIZY2ApFs5rBcGI_edit-3Fusp-3Dsharing&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=bLJxfpdKatSv5gaYIKVWx29RmEBKz-DTtRwuYd7lYks&e=>
>>>>>>>
>>>>>>> Please let me know if you have any thoughts on this.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Gengliang
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>> Software Engineer
>>>>>>>
>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> ---
>>>>>>> Takeshi Yamamuro
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Mime
View raw message