spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Gaido <marcogaid...@gmail.com>
Subject Re: [Discuss] Follow ANSI SQL on table insertion
Date Thu, 01 Aug 2019 08:55:32 GMT
Hi all,

I agree that having both modes and let the user choose the one he/she wants
is the best option (I don't see big arguments on this honestly). Once we
have this, I don't see big differences on what is the default. What - I
think - we still have to work on, is to go ahead with the "strict mode"
work and provide a more convenient way for users to switch among the 2
options. I mean: currently we have one flag for throwing exception on
overflow for operations on decimals, one for doing the same for operations
on other data types and probably going ahead we will have more. I think in
the end we will need to collect them all under an "umbrella" flag which
lets the user simply switch between strict and non-strict mode. I also
think that we will need to document this very well and give it particular
attention in our docs, maybe with a dedicated section, in order to provide
enough visibility on it to end users.

Thanks,
Marco

Il giorno gio 1 ago 2019 alle ore 09:42 Wenchen Fan <cloud0fan@gmail.com>
ha scritto:

> Hi Hyukjin, I think no one here is against the SQL standard behavior,
> which is no corrupted data + runtime exception. IIUC the main argument here
> is: shall we still keep the existing "return null for invalid operations"
> behavior as default?
>
> Traditional RDBMS is usually used as the final destination of CLEAN data.
> It's understandable that they need high data quality and they try their
> best to avoid corrupted data at any cost.
>
> However, Spark is different. AFAIK Spark is usually used as an ETL tool,
> which needs to deal with DIRTY data. It's convenient if Spark returns null
> for invalid data and then I can filter them out later. I agree that "null"
> doesn't always mean invalid data, but it usually does. The "return null"
> behavior is there for many years and I don't see many people complain about
> it in the past.
>
> Recently there are several new projects at the storage side, which can
> host CLEAN data and can connect to Spark. This does raise a new requirement
> in Spark to improve data quality. The community is already working on it:
> now arithmetic operation fails if overflow happens(need to set a config).
> We are going to apply the same to cast as well, and any other expressions
> that return null for invalid input data.
>
> Rome was not built in a day. I think we need to keep the legacy "return
> null" behavior as default for a while, until we have enough confidence
> about the new ANSI mode. The default behavior should fit the majority of
> the users. I believe currently ETL is still the main use case of Spark and
> the "return null" behavior is useful to a lot of users.
>
> On Thu, Aug 1, 2019 at 8:55 AM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>
>> I am sorry I am asking a question without reading whole discussion after
>> I replied.
>>
>> But why does Spark specifically needs to to it differently while ANCI
>> standard, other DBMSes and other systems do?
>> If there isn't a specific issue to Spark, that basically says they are
>> all wrong.
>>
>>
>> 2019년 8월 1일 (목) 오전 9:31, Russell Spitzer <russell.spitzer@gmail.com>님이
>> 작성:
>>
>>> Another solution along those lines that I know we implemented for
>>> limited precision types is just to do a loud warning whenever you do such a
>>> cast. IE: Warning we are casting X to Y this may result in data loss.
>>>
>>> On Wed, Jul 31, 2019 at 7:08 PM Russell Spitzer <
>>> russell.spitzer@gmail.com> wrote:
>>>
>>>> I would argue "null" doesn't have to mean invalid. It could mean
>>>> missing or deleted record. There is a lot of difference between missing
>>>> record and invalid record.
>>>>
>>>> I definitely have no problem with two modes, but I think setting a
>>>> parameter to enable lossy conversions is a fine tradeoff to avoid data loss
>>>> for others. The impact then for those who don't care about lossy casting
is
>>>> an analysis level message "Types don't match, to enable lossy casting set
>>>> some parameter" while the impact in the other direction is possibly
>>>> invisible until it hits something critical downstream.
>>>>
>>>> On Wed, Jul 31, 2019 at 6:50 PM Ryan Blue <rblue@netflix.com> wrote:
>>>>
>>>>> > you guys seem to be arguing no those users don't know what they
are
>>>>> doing and they should not exist.
>>>>>
>>>>> I'm not arguing that they don't exist. Just that the disproportionate
>>>>> impact of awareness about this behavior is much worse for people that
don't
>>>>> know about it. And there are a lot of those people as well.
>>>>>
>>>>> On Wed, Jul 31, 2019 at 4:48 PM Ryan Blue <rblue@netflix.com> wrote:
>>>>>
>>>>>> > "between a runtime error and an analysis-time error" → I think
one
>>>>>> of those should be the default.
>>>>>>
>>>>>> If you're saying that the default should be an error of some kind,
>>>>>> then I think we agree. I'm also fine with having a mode that allows
turning
>>>>>> off the error and silently replacing values with NULL... as long
as it
>>>>>> isn't the default and I can set the default for my platform to an
>>>>>> analysis-time error.
>>>>>>
>>>>>> On Wed, Jul 31, 2019 at 4:42 PM Russell Spitzer <
>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>
>>>>>>> I definitely view it as silently corrupting. If i'm copying over
a
>>>>>>> dataset where some elements are null and some have values, how
do I
>>>>>>> differentiate between my expected nulls and those that were added
in
>>>>>>> silently in the cast?
>>>>>>>
>>>>>>> On Wed, Jul 31, 2019 at 6:15 PM Reynold Xin <rxin@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> "between a runtime error and an analysis-time error" →
I think one
>>>>>>>> of those should be the default.
>>>>>>>>
>>>>>>>> Maybe we are talking past each other or I wasn't explaining
>>>>>>>> clearly, but I don't think you understand what I said and
the use cases out
>>>>>>>> there. I as an end user could very well be fully aware of
the consequences
>>>>>>>> of exceptional values but I can choose to ignore them. This
is especially
>>>>>>>> common for data scientists who are doing some quick and dirty
analysis or
>>>>>>>> exploration. You can't deny this large class of use cases
out there
>>>>>>>> (probably makes up half of Spark use cases actually).
>>>>>>>>
>>>>>>>> Also writing out the exceptional cases as null are not silently
>>>>>>>> corrupting them. The engine is sending an explicit signal
that the value is
>>>>>>>> no longer valid given the constraint.
>>>>>>>>
>>>>>>>> Not sure if this is the best analogy, but think about checked
>>>>>>>> exceptions in Java. It's great for writing low level code
in which error
>>>>>>>> handling is paramount, e.g. storage systems, network layers.
But in most
>>>>>>>> high level applications people just write boilerplate catches
that are
>>>>>>>> no-ops, because they have other priorities and they can tolerate
>>>>>>>> mishandling of exceptions, although often maybe they shouldn't.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 31, 2019 at 2:55 PM, Ryan Blue <rblue@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Another important aspect of this problem is whether a
user is
>>>>>>>>> conscious of the cast that is inserted by Spark. Most
of the time, users
>>>>>>>>> are not aware of casts that are implicitly inserted,
and that means
>>>>>>>>> replacing values with NULL would be a very surprising
behavior. The impact
>>>>>>>>> of this choice affects users disproportionately: someone
that knows about
>>>>>>>>> inserted casts is mildly annoyed when required to add
an explicit cast, but
>>>>>>>>> a user that doesn't know an inserted cast is dropping
values is very
>>>>>>>>> negatively impacted and may not discover the problem
until it is too late.
>>>>>>>>>
>>>>>>>>> That disproportionate impact is what makes me think that
it is not
>>>>>>>>> okay for Spark to silently replace values with NULL,
even if that's what
>>>>>>>>> ANSI would allow. Other databases also have the ability
to reject null
>>>>>>>>> values in tables, providing extra insurance against the
problem, but Spark
>>>>>>>>> doesn't have required columns in its DDL.
>>>>>>>>>
>>>>>>>>> So while I agree with Reynold that there is a trade-off,
I think
>>>>>>>>> that trade-off makes the choice between a runtime error
and an
>>>>>>>>> analysis-time error. I'm okay with either a runtime error
as the default or
>>>>>>>>> an analysis error as the default, as long as there is
a setting that allows
>>>>>>>>> me to choose one for my deployment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 31, 2019 at 10:39 AM Reynold Xin <rxin@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> OK to push back: "disagreeing with the premise that we
can afford
>>>>>>>>> to not be maximal on standard 3. The correctness of the
data is
>>>>>>>>> non-negotiable, and whatever solution we settle on cannot
silently adjust
>>>>>>>>> the user’s data under any circumstances."
>>>>>>>>>
>>>>>>>>> This blanket statement sounds great on surface, but there
are a
>>>>>>>>> lot of subtleties. "Correctness" is absolutely important,
but
>>>>>>>>> engineering/prod development are often about tradeoffs,
and the industry
>>>>>>>>> has consistently traded correctness for performance or
convenience, e.g.
>>>>>>>>> overflow checks, null pointers, consistency in databases
...
>>>>>>>>>
>>>>>>>>> It all depends on the use cases and to what degree use
cases can
>>>>>>>>> tolerate. For example, while I want my data engineering
production pipeline
>>>>>>>>> to throw any error when the data doesn't match my expectations
(e.g. type
>>>>>>>>> widening, overflow), if I'm doing some quick and dirty
data science, I
>>>>>>>>> don't want the job to just fail due to outliers.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 31, 2019 at 10:13 AM, Matt Cheah <mcheah@palantir.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Sorry I meant the current behavior for V2, which fails
the query
>>>>>>>>> compilation if the cast is not safe.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Agreed that a separate discussion about overflow might
be
>>>>>>>>> warranted. I’m surprised we don’t throw an error
now, but it might be
>>>>>>>>> warranted to do so.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matt Cheah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *Reynold Xin <rxin@databricks.com>
>>>>>>>>> *Date: *Wednesday, July 31, 2019 at 9:58 AM
>>>>>>>>> *To: *Matt Cheah <mcheah@palantir.com>
>>>>>>>>> *Cc: *Russell Spitzer <russell.spitzer@gmail.com>,
Takeshi
>>>>>>>>> Yamamuro <linguin.m.s@gmail.com>, Gengliang Wang
<
>>>>>>>>> gengliang.wang@databricks.com>, Ryan Blue <rblue@netflix.com>,
>>>>>>>>> Spark dev list <dev@spark.apache.org>, Hyukjin
Kwon <
>>>>>>>>> gurwls223@gmail.com>, Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> *Subject: *Re: [Discuss] Follow ANSI SQL on table insertion
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matt what do you mean by maximizing 3, while allowing
not throwing
>>>>>>>>> errors when any operations overflow? Those two seem contradicting.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah <mcheah@palantir.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I’m -1, simply from disagreeing with the premise that
we can
>>>>>>>>> afford to not be maximal on standard 3. The correctness
of the data is
>>>>>>>>> non-negotiable, and whatever solution we settle on cannot
silently adjust
>>>>>>>>> the user’s data under any circumstances.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think the existing behavior is fine, or perhaps the
behavior can
>>>>>>>>> be flagged by the destination writer at write time.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matt Cheah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *Hyukjin Kwon <gurwls223@gmail.com>
>>>>>>>>> *Date: *Monday, July 29, 2019 at 11:33 PM
>>>>>>>>> *To: *Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> *Cc: *Russell Spitzer <russell.spitzer@gmail.com>,
Takeshi
>>>>>>>>> Yamamuro <linguin.m.s@gmail.com>, Gengliang Wang
<
>>>>>>>>> gengliang.wang@databricks.com>, Ryan Blue <rblue@netflix.com>,
>>>>>>>>> Spark dev list <dev@spark.apache.org>
>>>>>>>>> *Subject: *Re: [Discuss] Follow ANSI SQL on table insertion
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From my look, +1 on the proposal, considering ASCI and
other
>>>>>>>>> DBMSes in general.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2019년 7월 30일 (화) 오후 3:21, Wenchen Fan <cloud0fan@gmail.com>님이
작성:
>>>>>>>>>
>>>>>>>>> We can add a config for a certain behavior if it makes
sense, but
>>>>>>>>> the most important thing we want to reach an agreement
here is: what should
>>>>>>>>> be the default behavior?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Let's explore the solution space of table insertion behavior
first:
>>>>>>>>>
>>>>>>>>> At compile time,
>>>>>>>>>
>>>>>>>>> 1. always add cast
>>>>>>>>>
>>>>>>>>> 2. add cast following the ASNI SQL store assignment rule
(e.g.
>>>>>>>>> string to int is forbidden but long to int is allowed)
>>>>>>>>>
>>>>>>>>> 3. only add cast if it's 100% safe
>>>>>>>>>
>>>>>>>>> At runtime,
>>>>>>>>>
>>>>>>>>> 1. return null for invalid operations
>>>>>>>>>
>>>>>>>>> 2. throw exceptions at runtime for invalid operations
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The standards to evaluate a solution:
>>>>>>>>>
>>>>>>>>> 1. How robust the query execution is. For example, users
usually
>>>>>>>>> don't want to see the query fails midway.
>>>>>>>>>
>>>>>>>>> 2. how tolerant to user queries. For example, a user
would like to
>>>>>>>>> write long values to an int column as he knows all the
long values won't
>>>>>>>>> exceed int range.
>>>>>>>>>
>>>>>>>>> 3. How clean the result is. For example, users usually
don't want
>>>>>>>>> to see silently corrupted data (null values).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The current Spark behavior for Data Source V1 tables:
always add
>>>>>>>>> cast and return null for invalid operations. This maximizes
standard 1 and
>>>>>>>>> 2, but the result is least clean and users are very likely
to see silently
>>>>>>>>> corrupted data (null values).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The current Spark behavior for Data Source V2 tables
(new in Spark
>>>>>>>>> 3.0): only add cast if it's 100% safe. This maximizes
standard 1 and 3, but
>>>>>>>>> many queries may fail to compile, even if these queries
can run on other
>>>>>>>>> SQL systems. Note that, people can still see silently
corrupted data
>>>>>>>>> because cast is not the only one that can return corrupted
data. Simple
>>>>>>>>> operations like ADD can also return corrected data if
overflow happens.
>>>>>>>>> e.g. INSERT INTO t1 (intCol) SELECT anotherIntCol + 100
FROM t2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The proposal here: add cast following ANSI SQL store
assignment
>>>>>>>>> rule, and return null for invalid operations. This maximizes
standard 1,
>>>>>>>>> and also fits standard 2 well: if a query can't compile
in Spark, it
>>>>>>>>> usually can't compile in other mainstream databases as
well. I think that's
>>>>>>>>> tolerant enough. For standard 3, this proposal doesn't
maximize it but can
>>>>>>>>> avoid many invalid operations already.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Technically we can't make the result 100% clean at compile-time,
>>>>>>>>> we have to handle things like overflow at runtime. I
think the new proposal
>>>>>>>>> makes more sense as the default behavior.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer <
>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I understand spark is making the decisions, i'm say the
actual
>>>>>>>>> final effect of the null decision would be different
depending on the
>>>>>>>>> insertion target if the target has different behaviors
for null.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > I'm a big -1 on null values for invalid casts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is why we want to introduce the ANSI mode, so that
invalid
>>>>>>>>> cast fails at runtime. But we have to keep the null behavior
for a while,
>>>>>>>>> to keep backward compatibility. Spark returns null for
invalid cast since
>>>>>>>>> the first day of Spark SQL, we can't just change it without
a way to
>>>>>>>>> restore to the old behavior.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm OK with adding a strict mode for the upcast behavior
in table
>>>>>>>>> insertion, but I don't agree with making it the default.
The default
>>>>>>>>> behavior should be either the ANSI SQL behavior or the
legacy Spark
>>>>>>>>> behavior.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > other modes should be allowed only with strict warning
the
>>>>>>>>> behavior will be determined by the underlying sink.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Seems there is some misunderstanding. The table insertion
behavior
>>>>>>>>> is fully controlled by Spark. Spark decides when to add
cast and Spark
>>>>>>>>> decided whether invalid cast should return null or fail.
The sink is only
>>>>>>>>> responsible for writing data, not the type coercion/cast
stuff.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <
>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I'm a big -1 on null values for invalid casts. This can
lead to a
>>>>>>>>> lot of even more unexpected errors and runtime behavior
since null is
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. Not allowed in all schemas (Leading to a runtime error
anyway)
>>>>>>>>> 2. Is the same as delete in some systems (leading to
data loss)
>>>>>>>>>
>>>>>>>>> And this would be dependent on the sink being used. Spark
won't
>>>>>>>>> just be interacting with ANSI compliant sinks so I think
it makes much more
>>>>>>>>> sense to be strict. I think Upcast mode is a sensible
default and other
>>>>>>>>> modes should be allowed only with strict warning the
behavior will be
>>>>>>>>> determined by the underlying sink.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro <
>>>>>>>>> linguin.m.s@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi, all
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1 for implementing this new store cast mode.
>>>>>>>>>
>>>>>>>>> From a viewpoint of DBMS users, this cast is pretty common
for
>>>>>>>>> INSERTs and I think this functionality could
>>>>>>>>>
>>>>>>>>> promote migrations from existing DBMSs to Spark.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The most important thing for DBMS users is that they
could
>>>>>>>>> optionally choose this mode when inserting data.
>>>>>>>>>
>>>>>>>>> Therefore, I think it might be okay that the two modes
(the
>>>>>>>>> current upcast mode and the proposed store cast mode)
>>>>>>>>>
>>>>>>>>> co-exist for INSERTs. (There is a room to discuss which
mode  is
>>>>>>>>> enabled by default though...)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> IMHO we'll provide three behaviours below for INSERTs;
>>>>>>>>>
>>>>>>>>>  - upcast mode
>>>>>>>>>
>>>>>>>>>  - ANSI store cast mode and runtime exceptions thrown
for invalid
>>>>>>>>> values
>>>>>>>>>
>>>>>>>>>  - ANSI store cast mode and null filled for invalid values
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang <
>>>>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for the suggestions on the proposal and doc.
>>>>>>>>>
>>>>>>>>> Currently, there is no data type validation in table
insertion of
>>>>>>>>> V1. We are on the same page that we should improve it.
But using UpCast is
>>>>>>>>> from one extreme to another. It is possible that many
queries are broken
>>>>>>>>> after upgrading to Spark 3.0.
>>>>>>>>>
>>>>>>>>> The rules of UpCast are too strict. E.g. it doesn't allow
>>>>>>>>> assigning Timestamp type to Date Type, as there will
be "precision loss".
>>>>>>>>> To me, the type coercion is reasonable and the "precision
loss" is under
>>>>>>>>> expectation. This is very common in other SQL engines.
>>>>>>>>>
>>>>>>>>> As long as Spark is following the ANSI SQL store assignment
rules,
>>>>>>>>> it is users' responsibility to take good care of the
type coercion in data
>>>>>>>>> writing. I think it's the right decision.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > But the new behavior is only applied in DataSourceV2,
so it
>>>>>>>>> won’t affect existing jobs until sources move to v2
and break other
>>>>>>>>> behavior anyway.
>>>>>>>>>
>>>>>>>>> Eventually, most sources are supposed to be migrated
to
>>>>>>>>> DataSourceV2 V2. I think we can discuss and make a decision
now.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > Fixing the silent corruption by adding a runtime
exception is
>>>>>>>>> not a good option, either.
>>>>>>>>>
>>>>>>>>> The new optional mode proposed in https://issues.apache.org/jira/browse/SPARK-28512
>>>>>>>>> [issues.apache.org]
>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D28512&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=kluCSIsFxJd_F1ljwRLu_dVRBvOYekEogDH4KWhfkkA&e=>
is
>>>>>>>>> disabled by default. This should be fine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I don't agree with handling literal values specially.
Although
>>>>>>>>> Postgres does it, I can't find anything about it in the
SQL standard. And
>>>>>>>>> it introduces inconsistent behaviors which may be strange
to users:
>>>>>>>>>
>>>>>>>>> * What about something like "INSERT INTO t SELECT float_col
+ 1.1"?
>>>>>>>>> * The same insert with a decimal column as input will
fail even
>>>>>>>>> when a decimal literal would succeed
>>>>>>>>> * Similar insert queries with "literal" inputs can be
constructed
>>>>>>>>> through layers of indirection via views, inline views,
CTEs, unions, etc.
>>>>>>>>> Would those decimals be treated as columns and fail or
would we attempt to
>>>>>>>>> make them succeed as well? Would users find this behavior
surprising?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Silently corrupt data is bad, but this is the decision
we made at
>>>>>>>>> the beginning when design Spark behaviors. Whenever an
error occurs, Spark
>>>>>>>>> attempts to return null instead of runtime exception.
Recently we provide
>>>>>>>>> configs to make Spark fail at runtime for overflow, but
that's another
>>>>>>>>> story. Silently corrupt data is bad, runtime exception
is bad, and
>>>>>>>>> forbidding all the table insertions that may fail(even
with very little
>>>>>>>>> possibility) is also bad. We have to make trade-offs.
The trade-offs we
>>>>>>>>> made in this proposal are:
>>>>>>>>>
>>>>>>>>> * forbid table insertions that are very like to fail,
at compile
>>>>>>>>> time. (things like writing string values to int column)
>>>>>>>>>
>>>>>>>>> * allow table insertions that are not that likely to
fail. If the
>>>>>>>>> data is wrong, don't fail, insert null.
>>>>>>>>>
>>>>>>>>> * provide a config to fail the insertion at runtime if
the data is
>>>>>>>>> wrong.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> >  But the new behavior is only applied in DataSourceV2,
so it
>>>>>>>>> won’t affect existing jobs until sources move to v2
and break other
>>>>>>>>> behavior anyway.
>>>>>>>>>
>>>>>>>>> When users write SQL queries, they don't care if a table
is backed
>>>>>>>>> by Data Source V1 or V2. We should make sure the table
insertion behavior
>>>>>>>>> is consistent and reasonable. Furthermore, users may
even not care if the
>>>>>>>>> SQL queries are run in Spark or other RDBMS, it's better
to follow SQL
>>>>>>>>> standard instead of introducing a Spark-specific behavior.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We are not talking about a small use case like allowing
writing
>>>>>>>>> decimal literal to float column, we are talking about
a big goal to make
>>>>>>>>> Spark compliant to SQL standard, w.r.t. https://issues.apache.org/jira/browse/SPARK-26217
>>>>>>>>> [issues.apache.org]
>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D26217&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=Y9ReXszbCYw7dLvhzbVJD7wYppANZpZtUkoSNJ-MJC4&e=>
.
>>>>>>>>> This proposal is a sub-task of it, to make the table
insertion behavior
>>>>>>>>> follow SQL standard.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rblue@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I don’t think this is a good idea. Following the ANSI
standard is
>>>>>>>>> usually fine, but here it would *silently corrupt data*.
>>>>>>>>>
>>>>>>>>> From your proposal doc, ANSI allows implicitly casting
from long
>>>>>>>>> to int (any numeric type to any other numeric type) and
inserts
>>>>>>>>> NULL when a value overflows. That would drop data values
and is not safe.
>>>>>>>>>
>>>>>>>>> Fixing the silent corruption by adding a runtime exception
is not
>>>>>>>>> a good option, either. That puts off the problem until
much of the job has
>>>>>>>>> completed, instead of catching the error at analysis
time. It is better to
>>>>>>>>> catch this earlier during analysis than to run most of
a job and then fail.
>>>>>>>>>
>>>>>>>>> In addition, part of the justification for using the
ANSI standard
>>>>>>>>> is to avoid breaking existing jobs. But the new behavior
is only applied in
>>>>>>>>> DataSourceV2, so it won’t affect existing jobs until
sources move to v2 and
>>>>>>>>> break other behavior anyway.
>>>>>>>>>
>>>>>>>>> I think that the correct solution is to go with the existing
>>>>>>>>> validation rules that require explicit casts to truncate
values.
>>>>>>>>>
>>>>>>>>> That still leaves the use case that motivated this proposal,
which
>>>>>>>>> is that floating point literals are parsed as decimals
and fail simple
>>>>>>>>> insert statements. We already came up with two alternatives
to fix that
>>>>>>>>> problem in the DSv2 sync and I think it is a better idea
to go with one of
>>>>>>>>> those instead of “fixing” Spark in a way that will
corrupt data or cause
>>>>>>>>> runtime failures.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I have heard about many complaints about the old table
insertion
>>>>>>>>> behavior. Blindly casting everything will leak the user
mistake to a late
>>>>>>>>> stage of the data pipeline, and make it very hard to
debug. When a user
>>>>>>>>> writes string values to an int column, it's probably
a mistake and the
>>>>>>>>> columns are misordered in the INSERT statement. We should
fail the query
>>>>>>>>> earlier and ask users to fix the mistake.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the meanwhile, I agree that the new table insertion
behavior we
>>>>>>>>> introduced for Data Source V2 is too strict. It may fail
valid queries
>>>>>>>>> unexpectedly.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In general, I support the direction of following the
ANSI SQL
>>>>>>>>> standard. But I'd like to do it with 2 steps:
>>>>>>>>>
>>>>>>>>> 1. only add cast when the assignment rule is satisfied.
This
>>>>>>>>> should be the default behavior and we should provide
a legacy config to
>>>>>>>>> restore to the old behavior.
>>>>>>>>>
>>>>>>>>> 2. fail the cast operation at runtime if overflow happens.
AFAIK
>>>>>>>>> Marco Gaido is working on it already. This will have
a config as well and
>>>>>>>>> by default we still return null.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After doing this, the default behavior will be slightly
different
>>>>>>>>> from the SQL standard (cast can return null), and users
can turn on the
>>>>>>>>> ANSI mode to fully follow the SQL standard. This is much
better than before
>>>>>>>>> and should prevent a lot of user mistakes. It's also
a reasonable choice to
>>>>>>>>> me to not throw exceptions at runtime by default, as
it's usually bad for
>>>>>>>>> long-running jobs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Wenchen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
>>>>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would like to discuss the table insertion behavior
of Spark. In
>>>>>>>>> the current data source V2, only UpCast is allowed for
table insertion. I
>>>>>>>>> think following ANSI SQL is a better idea.
>>>>>>>>>
>>>>>>>>> For more information, please read the Discuss: Follow
ANSI SQL on
>>>>>>>>> table insertion [docs.google.com]
>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1b9nnWWbKVDRp7lpzhQS1buv1-5FlDzWIZY2ApFs5rBcGI_edit-3Fusp-3Dsharing&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=bLJxfpdKatSv5gaYIKVWx29RmEBKz-DTtRwuYd7lYks&e=>
>>>>>>>>>
>>>>>>>>> Please let me know if you have any thoughts on this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Gengliang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Ryan Blue
>>>>>>>>>
>>>>>>>>> Software Engineer
>>>>>>>>>
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Takeshi Yamamuro
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>

Mime
View raw message