spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: [Discuss] Follow ANSI SQL on table insertion
Date Thu, 01 Aug 2019 08:03:55 GMT
> "return null for invalid operations" behavior as default

This one. Not only traditional RDBMS. From what I heard, Impala has
non-strict mode and strict mode,
the default is non-strict mode too. Hive casts to null as well.
Yes, so I was about to argue that we should return "null" rather than
breaking the behaviour for DSv2 here alone.




2019년 8월 1일 (목) 오후 4:41, Wenchen Fan <cloud0fan@gmail.com>님이 작성:

> Hi Hyukjin, I think no one here is against the SQL standard behavior,
> which is no corrupted data + runtime exception. IIUC the main argument here
> is: shall we still keep the existing "return null for invalid operations"
> behavior as default?
>
> Traditional RDBMS is usually used as the final destination of CLEAN data.
> It's understandable that they need high data quality and they try their
> best to avoid corrupted data at any cost.
>
> However, Spark is different. AFAIK Spark is usually used as an ETL tool,
> which needs to deal with DIRTY data. It's convenient if Spark returns null
> for invalid data and then I can filter them out later. I agree that "null"
> doesn't always mean invalid data, but it usually does. The "return null"
> behavior is there for many years and I don't see many people complain about
> it in the past.
>
> Recently there are several new projects at the storage side, which can
> host CLEAN data and can connect to Spark. This does raise a new requirement
> in Spark to improve data quality. The community is already working on it:
> now arithmetic operation fails if overflow happens(need to set a config).
> We are going to apply the same to cast as well, and any other expressions
> that return null for invalid input data.
>
> Rome was not built in a day. I think we need to keep the legacy "return
> null" behavior as default for a while, until we have enough confidence
> about the new ANSI mode. The default behavior should fit the majority of
> the users. I believe currently ETL is still the main use case of Spark and
> the "return null" behavior is useful to a lot of users.
>
> On Thu, Aug 1, 2019 at 8:55 AM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>
>> I am sorry I am asking a question without reading whole discussion after
>> I replied.
>>
>> But why does Spark specifically needs to to it differently while ANCI
>> standard, other DBMSes and other systems do?
>> If there isn't a specific issue to Spark, that basically says they are
>> all wrong.
>>
>>
>> 2019년 8월 1일 (목) 오전 9:31, Russell Spitzer <russell.spitzer@gmail.com>님이
>> 작성:
>>
>>> Another solution along those lines that I know we implemented for
>>> limited precision types is just to do a loud warning whenever you do such a
>>> cast. IE: Warning we are casting X to Y this may result in data loss.
>>>
>>> On Wed, Jul 31, 2019 at 7:08 PM Russell Spitzer <
>>> russell.spitzer@gmail.com> wrote:
>>>
>>>> I would argue "null" doesn't have to mean invalid. It could mean
>>>> missing or deleted record. There is a lot of difference between missing
>>>> record and invalid record.
>>>>
>>>> I definitely have no problem with two modes, but I think setting a
>>>> parameter to enable lossy conversions is a fine tradeoff to avoid data loss
>>>> for others. The impact then for those who don't care about lossy casting
is
>>>> an analysis level message "Types don't match, to enable lossy casting set
>>>> some parameter" while the impact in the other direction is possibly
>>>> invisible until it hits something critical downstream.
>>>>
>>>> On Wed, Jul 31, 2019 at 6:50 PM Ryan Blue <rblue@netflix.com> wrote:
>>>>
>>>>> > you guys seem to be arguing no those users don't know what they
are
>>>>> doing and they should not exist.
>>>>>
>>>>> I'm not arguing that they don't exist. Just that the disproportionate
>>>>> impact of awareness about this behavior is much worse for people that
don't
>>>>> know about it. And there are a lot of those people as well.
>>>>>
>>>>> On Wed, Jul 31, 2019 at 4:48 PM Ryan Blue <rblue@netflix.com> wrote:
>>>>>
>>>>>> > "between a runtime error and an analysis-time error" → I think
one
>>>>>> of those should be the default.
>>>>>>
>>>>>> If you're saying that the default should be an error of some kind,
>>>>>> then I think we agree. I'm also fine with having a mode that allows
turning
>>>>>> off the error and silently replacing values with NULL... as long
as it
>>>>>> isn't the default and I can set the default for my platform to an
>>>>>> analysis-time error.
>>>>>>
>>>>>> On Wed, Jul 31, 2019 at 4:42 PM Russell Spitzer <
>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>
>>>>>>> I definitely view it as silently corrupting. If i'm copying over
a
>>>>>>> dataset where some elements are null and some have values, how
do I
>>>>>>> differentiate between my expected nulls and those that were added
in
>>>>>>> silently in the cast?
>>>>>>>
>>>>>>> On Wed, Jul 31, 2019 at 6:15 PM Reynold Xin <rxin@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> "between a runtime error and an analysis-time error" →
I think one
>>>>>>>> of those should be the default.
>>>>>>>>
>>>>>>>> Maybe we are talking past each other or I wasn't explaining
>>>>>>>> clearly, but I don't think you understand what I said and
the use cases out
>>>>>>>> there. I as an end user could very well be fully aware of
the consequences
>>>>>>>> of exceptional values but I can choose to ignore them. This
is especially
>>>>>>>> common for data scientists who are doing some quick and dirty
analysis or
>>>>>>>> exploration. You can't deny this large class of use cases
out there
>>>>>>>> (probably makes up half of Spark use cases actually).
>>>>>>>>
>>>>>>>> Also writing out the exceptional cases as null are not silently
>>>>>>>> corrupting them. The engine is sending an explicit signal
that the value is
>>>>>>>> no longer valid given the constraint.
>>>>>>>>
>>>>>>>> Not sure if this is the best analogy, but think about checked
>>>>>>>> exceptions in Java. It's great for writing low level code
in which error
>>>>>>>> handling is paramount, e.g. storage systems, network layers.
But in most
>>>>>>>> high level applications people just write boilerplate catches
that are
>>>>>>>> no-ops, because they have other priorities and they can tolerate
>>>>>>>> mishandling of exceptions, although often maybe they shouldn't.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 31, 2019 at 2:55 PM, Ryan Blue <rblue@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Another important aspect of this problem is whether a
user is
>>>>>>>>> conscious of the cast that is inserted by Spark. Most
of the time, users
>>>>>>>>> are not aware of casts that are implicitly inserted,
and that means
>>>>>>>>> replacing values with NULL would be a very surprising
behavior. The impact
>>>>>>>>> of this choice affects users disproportionately: someone
that knows about
>>>>>>>>> inserted casts is mildly annoyed when required to add
an explicit cast, but
>>>>>>>>> a user that doesn't know an inserted cast is dropping
values is very
>>>>>>>>> negatively impacted and may not discover the problem
until it is too late.
>>>>>>>>>
>>>>>>>>> That disproportionate impact is what makes me think that
it is not
>>>>>>>>> okay for Spark to silently replace values with NULL,
even if that's what
>>>>>>>>> ANSI would allow. Other databases also have the ability
to reject null
>>>>>>>>> values in tables, providing extra insurance against the
problem, but Spark
>>>>>>>>> doesn't have required columns in its DDL.
>>>>>>>>>
>>>>>>>>> So while I agree with Reynold that there is a trade-off,
I think
>>>>>>>>> that trade-off makes the choice between a runtime error
and an
>>>>>>>>> analysis-time error. I'm okay with either a runtime error
as the default or
>>>>>>>>> an analysis error as the default, as long as there is
a setting that allows
>>>>>>>>> me to choose one for my deployment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 31, 2019 at 10:39 AM Reynold Xin <rxin@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> OK to push back: "disagreeing with the premise that we
can afford
>>>>>>>>> to not be maximal on standard 3. The correctness of the
data is
>>>>>>>>> non-negotiable, and whatever solution we settle on cannot
silently adjust
>>>>>>>>> the user’s data under any circumstances."
>>>>>>>>>
>>>>>>>>> This blanket statement sounds great on surface, but there
are a
>>>>>>>>> lot of subtleties. "Correctness" is absolutely important,
but
>>>>>>>>> engineering/prod development are often about tradeoffs,
and the industry
>>>>>>>>> has consistently traded correctness for performance or
convenience, e.g.
>>>>>>>>> overflow checks, null pointers, consistency in databases
...
>>>>>>>>>
>>>>>>>>> It all depends on the use cases and to what degree use
cases can
>>>>>>>>> tolerate. For example, while I want my data engineering
production pipeline
>>>>>>>>> to throw any error when the data doesn't match my expectations
(e.g. type
>>>>>>>>> widening, overflow), if I'm doing some quick and dirty
data science, I
>>>>>>>>> don't want the job to just fail due to outliers.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 31, 2019 at 10:13 AM, Matt Cheah <mcheah@palantir.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Sorry I meant the current behavior for V2, which fails
the query
>>>>>>>>> compilation if the cast is not safe.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Agreed that a separate discussion about overflow might
be
>>>>>>>>> warranted. I’m surprised we don’t throw an error
now, but it might be
>>>>>>>>> warranted to do so.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matt Cheah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *Reynold Xin <rxin@databricks.com>
>>>>>>>>> *Date: *Wednesday, July 31, 2019 at 9:58 AM
>>>>>>>>> *To: *Matt Cheah <mcheah@palantir.com>
>>>>>>>>> *Cc: *Russell Spitzer <russell.spitzer@gmail.com>,
Takeshi
>>>>>>>>> Yamamuro <linguin.m.s@gmail.com>, Gengliang Wang
<
>>>>>>>>> gengliang.wang@databricks.com>, Ryan Blue <rblue@netflix.com>,
>>>>>>>>> Spark dev list <dev@spark.apache.org>, Hyukjin
Kwon <
>>>>>>>>> gurwls223@gmail.com>, Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> *Subject: *Re: [Discuss] Follow ANSI SQL on table insertion
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matt what do you mean by maximizing 3, while allowing
not throwing
>>>>>>>>> errors when any operations overflow? Those two seem contradicting.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah <mcheah@palantir.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I’m -1, simply from disagreeing with the premise that
we can
>>>>>>>>> afford to not be maximal on standard 3. The correctness
of the data is
>>>>>>>>> non-negotiable, and whatever solution we settle on cannot
silently adjust
>>>>>>>>> the user’s data under any circumstances.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think the existing behavior is fine, or perhaps the
behavior can
>>>>>>>>> be flagged by the destination writer at write time.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matt Cheah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *Hyukjin Kwon <gurwls223@gmail.com>
>>>>>>>>> *Date: *Monday, July 29, 2019 at 11:33 PM
>>>>>>>>> *To: *Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> *Cc: *Russell Spitzer <russell.spitzer@gmail.com>,
Takeshi
>>>>>>>>> Yamamuro <linguin.m.s@gmail.com>, Gengliang Wang
<
>>>>>>>>> gengliang.wang@databricks.com>, Ryan Blue <rblue@netflix.com>,
>>>>>>>>> Spark dev list <dev@spark.apache.org>
>>>>>>>>> *Subject: *Re: [Discuss] Follow ANSI SQL on table insertion
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From my look, +1 on the proposal, considering ASCI and
other
>>>>>>>>> DBMSes in general.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2019년 7월 30일 (화) 오후 3:21, Wenchen Fan <cloud0fan@gmail.com>님이
작성:
>>>>>>>>>
>>>>>>>>> We can add a config for a certain behavior if it makes
sense, but
>>>>>>>>> the most important thing we want to reach an agreement
here is: what should
>>>>>>>>> be the default behavior?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Let's explore the solution space of table insertion behavior
first:
>>>>>>>>>
>>>>>>>>> At compile time,
>>>>>>>>>
>>>>>>>>> 1. always add cast
>>>>>>>>>
>>>>>>>>> 2. add cast following the ASNI SQL store assignment rule
(e.g.
>>>>>>>>> string to int is forbidden but long to int is allowed)
>>>>>>>>>
>>>>>>>>> 3. only add cast if it's 100% safe
>>>>>>>>>
>>>>>>>>> At runtime,
>>>>>>>>>
>>>>>>>>> 1. return null for invalid operations
>>>>>>>>>
>>>>>>>>> 2. throw exceptions at runtime for invalid operations
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The standards to evaluate a solution:
>>>>>>>>>
>>>>>>>>> 1. How robust the query execution is. For example, users
usually
>>>>>>>>> don't want to see the query fails midway.
>>>>>>>>>
>>>>>>>>> 2. how tolerant to user queries. For example, a user
would like to
>>>>>>>>> write long values to an int column as he knows all the
long values won't
>>>>>>>>> exceed int range.
>>>>>>>>>
>>>>>>>>> 3. How clean the result is. For example, users usually
don't want
>>>>>>>>> to see silently corrupted data (null values).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The current Spark behavior for Data Source V1 tables:
always add
>>>>>>>>> cast and return null for invalid operations. This maximizes
standard 1 and
>>>>>>>>> 2, but the result is least clean and users are very likely
to see silently
>>>>>>>>> corrupted data (null values).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The current Spark behavior for Data Source V2 tables
(new in Spark
>>>>>>>>> 3.0): only add cast if it's 100% safe. This maximizes
standard 1 and 3, but
>>>>>>>>> many queries may fail to compile, even if these queries
can run on other
>>>>>>>>> SQL systems. Note that, people can still see silently
corrupted data
>>>>>>>>> because cast is not the only one that can return corrupted
data. Simple
>>>>>>>>> operations like ADD can also return corrected data if
overflow happens.
>>>>>>>>> e.g. INSERT INTO t1 (intCol) SELECT anotherIntCol + 100
FROM t2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The proposal here: add cast following ANSI SQL store
assignment
>>>>>>>>> rule, and return null for invalid operations. This maximizes
standard 1,
>>>>>>>>> and also fits standard 2 well: if a query can't compile
in Spark, it
>>>>>>>>> usually can't compile in other mainstream databases as
well. I think that's
>>>>>>>>> tolerant enough. For standard 3, this proposal doesn't
maximize it but can
>>>>>>>>> avoid many invalid operations already.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Technically we can't make the result 100% clean at compile-time,
>>>>>>>>> we have to handle things like overflow at runtime. I
think the new proposal
>>>>>>>>> makes more sense as the default behavior.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer <
>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I understand spark is making the decisions, i'm say the
actual
>>>>>>>>> final effect of the null decision would be different
depending on the
>>>>>>>>> insertion target if the target has different behaviors
for null.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > I'm a big -1 on null values for invalid casts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is why we want to introduce the ANSI mode, so that
invalid
>>>>>>>>> cast fails at runtime. But we have to keep the null behavior
for a while,
>>>>>>>>> to keep backward compatibility. Spark returns null for
invalid cast since
>>>>>>>>> the first day of Spark SQL, we can't just change it without
a way to
>>>>>>>>> restore to the old behavior.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm OK with adding a strict mode for the upcast behavior
in table
>>>>>>>>> insertion, but I don't agree with making it the default.
The default
>>>>>>>>> behavior should be either the ANSI SQL behavior or the
legacy Spark
>>>>>>>>> behavior.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > other modes should be allowed only with strict warning
the
>>>>>>>>> behavior will be determined by the underlying sink.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Seems there is some misunderstanding. The table insertion
behavior
>>>>>>>>> is fully controlled by Spark. Spark decides when to add
cast and Spark
>>>>>>>>> decided whether invalid cast should return null or fail.
The sink is only
>>>>>>>>> responsible for writing data, not the type coercion/cast
stuff.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <
>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I'm a big -1 on null values for invalid casts. This can
lead to a
>>>>>>>>> lot of even more unexpected errors and runtime behavior
since null is
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. Not allowed in all schemas (Leading to a runtime error
anyway)
>>>>>>>>> 2. Is the same as delete in some systems (leading to
data loss)
>>>>>>>>>
>>>>>>>>> And this would be dependent on the sink being used. Spark
won't
>>>>>>>>> just be interacting with ANSI compliant sinks so I think
it makes much more
>>>>>>>>> sense to be strict. I think Upcast mode is a sensible
default and other
>>>>>>>>> modes should be allowed only with strict warning the
behavior will be
>>>>>>>>> determined by the underlying sink.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro <
>>>>>>>>> linguin.m.s@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi, all
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1 for implementing this new store cast mode.
>>>>>>>>>
>>>>>>>>> From a viewpoint of DBMS users, this cast is pretty common
for
>>>>>>>>> INSERTs and I think this functionality could
>>>>>>>>>
>>>>>>>>> promote migrations from existing DBMSs to Spark.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The most important thing for DBMS users is that they
could
>>>>>>>>> optionally choose this mode when inserting data.
>>>>>>>>>
>>>>>>>>> Therefore, I think it might be okay that the two modes
(the
>>>>>>>>> current upcast mode and the proposed store cast mode)
>>>>>>>>>
>>>>>>>>> co-exist for INSERTs. (There is a room to discuss which
mode  is
>>>>>>>>> enabled by default though...)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> IMHO we'll provide three behaviours below for INSERTs;
>>>>>>>>>
>>>>>>>>>  - upcast mode
>>>>>>>>>
>>>>>>>>>  - ANSI store cast mode and runtime exceptions thrown
for invalid
>>>>>>>>> values
>>>>>>>>>
>>>>>>>>>  - ANSI store cast mode and null filled for invalid values
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang <
>>>>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for the suggestions on the proposal and doc.
>>>>>>>>>
>>>>>>>>> Currently, there is no data type validation in table
insertion of
>>>>>>>>> V1. We are on the same page that we should improve it.
But using UpCast is
>>>>>>>>> from one extreme to another. It is possible that many
queries are broken
>>>>>>>>> after upgrading to Spark 3.0.
>>>>>>>>>
>>>>>>>>> The rules of UpCast are too strict. E.g. it doesn't allow
>>>>>>>>> assigning Timestamp type to Date Type, as there will
be "precision loss".
>>>>>>>>> To me, the type coercion is reasonable and the "precision
loss" is under
>>>>>>>>> expectation. This is very common in other SQL engines.
>>>>>>>>>
>>>>>>>>> As long as Spark is following the ANSI SQL store assignment
rules,
>>>>>>>>> it is users' responsibility to take good care of the
type coercion in data
>>>>>>>>> writing. I think it's the right decision.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > But the new behavior is only applied in DataSourceV2,
so it
>>>>>>>>> won’t affect existing jobs until sources move to v2
and break other
>>>>>>>>> behavior anyway.
>>>>>>>>>
>>>>>>>>> Eventually, most sources are supposed to be migrated
to
>>>>>>>>> DataSourceV2 V2. I think we can discuss and make a decision
now.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > Fixing the silent corruption by adding a runtime
exception is
>>>>>>>>> not a good option, either.
>>>>>>>>>
>>>>>>>>> The new optional mode proposed in https://issues.apache.org/jira/browse/SPARK-28512
>>>>>>>>> [issues.apache.org]
>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D28512&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=kluCSIsFxJd_F1ljwRLu_dVRBvOYekEogDH4KWhfkkA&e=>
is
>>>>>>>>> disabled by default. This should be fine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I don't agree with handling literal values specially.
Although
>>>>>>>>> Postgres does it, I can't find anything about it in the
SQL standard. And
>>>>>>>>> it introduces inconsistent behaviors which may be strange
to users:
>>>>>>>>>
>>>>>>>>> * What about something like "INSERT INTO t SELECT float_col
+ 1.1"?
>>>>>>>>> * The same insert with a decimal column as input will
fail even
>>>>>>>>> when a decimal literal would succeed
>>>>>>>>> * Similar insert queries with "literal" inputs can be
constructed
>>>>>>>>> through layers of indirection via views, inline views,
CTEs, unions, etc.
>>>>>>>>> Would those decimals be treated as columns and fail or
would we attempt to
>>>>>>>>> make them succeed as well? Would users find this behavior
surprising?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Silently corrupt data is bad, but this is the decision
we made at
>>>>>>>>> the beginning when design Spark behaviors. Whenever an
error occurs, Spark
>>>>>>>>> attempts to return null instead of runtime exception.
Recently we provide
>>>>>>>>> configs to make Spark fail at runtime for overflow, but
that's another
>>>>>>>>> story. Silently corrupt data is bad, runtime exception
is bad, and
>>>>>>>>> forbidding all the table insertions that may fail(even
with very little
>>>>>>>>> possibility) is also bad. We have to make trade-offs.
The trade-offs we
>>>>>>>>> made in this proposal are:
>>>>>>>>>
>>>>>>>>> * forbid table insertions that are very like to fail,
at compile
>>>>>>>>> time. (things like writing string values to int column)
>>>>>>>>>
>>>>>>>>> * allow table insertions that are not that likely to
fail. If the
>>>>>>>>> data is wrong, don't fail, insert null.
>>>>>>>>>
>>>>>>>>> * provide a config to fail the insertion at runtime if
the data is
>>>>>>>>> wrong.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> >  But the new behavior is only applied in DataSourceV2,
so it
>>>>>>>>> won’t affect existing jobs until sources move to v2
and break other
>>>>>>>>> behavior anyway.
>>>>>>>>>
>>>>>>>>> When users write SQL queries, they don't care if a table
is backed
>>>>>>>>> by Data Source V1 or V2. We should make sure the table
insertion behavior
>>>>>>>>> is consistent and reasonable. Furthermore, users may
even not care if the
>>>>>>>>> SQL queries are run in Spark or other RDBMS, it's better
to follow SQL
>>>>>>>>> standard instead of introducing a Spark-specific behavior.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We are not talking about a small use case like allowing
writing
>>>>>>>>> decimal literal to float column, we are talking about
a big goal to make
>>>>>>>>> Spark compliant to SQL standard, w.r.t. https://issues.apache.org/jira/browse/SPARK-26217
>>>>>>>>> [issues.apache.org]
>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D26217&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=Y9ReXszbCYw7dLvhzbVJD7wYppANZpZtUkoSNJ-MJC4&e=>
.
>>>>>>>>> This proposal is a sub-task of it, to make the table
insertion behavior
>>>>>>>>> follow SQL standard.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rblue@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I don’t think this is a good idea. Following the ANSI
standard is
>>>>>>>>> usually fine, but here it would *silently corrupt data*.
>>>>>>>>>
>>>>>>>>> From your proposal doc, ANSI allows implicitly casting
from long
>>>>>>>>> to int (any numeric type to any other numeric type) and
inserts
>>>>>>>>> NULL when a value overflows. That would drop data values
and is not safe.
>>>>>>>>>
>>>>>>>>> Fixing the silent corruption by adding a runtime exception
is not
>>>>>>>>> a good option, either. That puts off the problem until
much of the job has
>>>>>>>>> completed, instead of catching the error at analysis
time. It is better to
>>>>>>>>> catch this earlier during analysis than to run most of
a job and then fail.
>>>>>>>>>
>>>>>>>>> In addition, part of the justification for using the
ANSI standard
>>>>>>>>> is to avoid breaking existing jobs. But the new behavior
is only applied in
>>>>>>>>> DataSourceV2, so it won’t affect existing jobs until
sources move to v2 and
>>>>>>>>> break other behavior anyway.
>>>>>>>>>
>>>>>>>>> I think that the correct solution is to go with the existing
>>>>>>>>> validation rules that require explicit casts to truncate
values.
>>>>>>>>>
>>>>>>>>> That still leaves the use case that motivated this proposal,
which
>>>>>>>>> is that floating point literals are parsed as decimals
and fail simple
>>>>>>>>> insert statements. We already came up with two alternatives
to fix that
>>>>>>>>> problem in the DSv2 sync and I think it is a better idea
to go with one of
>>>>>>>>> those instead of “fixing” Spark in a way that will
corrupt data or cause
>>>>>>>>> runtime failures.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I have heard about many complaints about the old table
insertion
>>>>>>>>> behavior. Blindly casting everything will leak the user
mistake to a late
>>>>>>>>> stage of the data pipeline, and make it very hard to
debug. When a user
>>>>>>>>> writes string values to an int column, it's probably
a mistake and the
>>>>>>>>> columns are misordered in the INSERT statement. We should
fail the query
>>>>>>>>> earlier and ask users to fix the mistake.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the meanwhile, I agree that the new table insertion
behavior we
>>>>>>>>> introduced for Data Source V2 is too strict. It may fail
valid queries
>>>>>>>>> unexpectedly.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In general, I support the direction of following the
ANSI SQL
>>>>>>>>> standard. But I'd like to do it with 2 steps:
>>>>>>>>>
>>>>>>>>> 1. only add cast when the assignment rule is satisfied.
This
>>>>>>>>> should be the default behavior and we should provide
a legacy config to
>>>>>>>>> restore to the old behavior.
>>>>>>>>>
>>>>>>>>> 2. fail the cast operation at runtime if overflow happens.
AFAIK
>>>>>>>>> Marco Gaido is working on it already. This will have
a config as well and
>>>>>>>>> by default we still return null.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After doing this, the default behavior will be slightly
different
>>>>>>>>> from the SQL standard (cast can return null), and users
can turn on the
>>>>>>>>> ANSI mode to fully follow the SQL standard. This is much
better than before
>>>>>>>>> and should prevent a lot of user mistakes. It's also
a reasonable choice to
>>>>>>>>> me to not throw exceptions at runtime by default, as
it's usually bad for
>>>>>>>>> long-running jobs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Wenchen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
>>>>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would like to discuss the table insertion behavior
of Spark. In
>>>>>>>>> the current data source V2, only UpCast is allowed for
table insertion. I
>>>>>>>>> think following ANSI SQL is a better idea.
>>>>>>>>>
>>>>>>>>> For more information, please read the Discuss: Follow
ANSI SQL on
>>>>>>>>> table insertion [docs.google.com]
>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1b9nnWWbKVDRp7lpzhQS1buv1-5FlDzWIZY2ApFs5rBcGI_edit-3Fusp-3Dsharing&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=bLJxfpdKatSv5gaYIKVWx29RmEBKz-DTtRwuYd7lYks&e=>
>>>>>>>>>
>>>>>>>>> Please let me know if you have any thoughts on this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Gengliang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Ryan Blue
>>>>>>>>>
>>>>>>>>> Software Engineer
>>>>>>>>>
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Takeshi Yamamuro
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>

Mime
View raw message