spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Spitzer <russell.spit...@gmail.com>
Subject Re: [Discuss] Follow ANSI SQL on table insertion
Date Wed, 31 Jul 2019 23:42:17 GMT
I definitely view it as silently corrupting. If i'm copying over a dataset
where some elements are null and some have values, how do I differentiate
between my expected nulls and those that were added in silently in the
cast?

On Wed, Jul 31, 2019 at 6:15 PM Reynold Xin <rxin@databricks.com> wrote:

> "between a runtime error and an analysis-time error" → I think one of
> those should be the default.
>
> Maybe we are talking past each other or I wasn't explaining clearly, but I
> don't think you understand what I said and the use cases out there. I as an
> end user could very well be fully aware of the consequences of exceptional
> values but I can choose to ignore them. This is especially common for data
> scientists who are doing some quick and dirty analysis or exploration. You
> can't deny this large class of use cases out there (probably makes up half
> of Spark use cases actually).
>
> Also writing out the exceptional cases as null are not silently corrupting
> them. The engine is sending an explicit signal that the value is no longer
> valid given the constraint.
>
> Not sure if this is the best analogy, but think about checked exceptions
> in Java. It's great for writing low level code in which error handling is
> paramount, e.g. storage systems, network layers. But in most high level
> applications people just write boilerplate catches that are no-ops, because
> they have other priorities and they can tolerate mishandling of exceptions,
> although often maybe they shouldn't.
>
>
>
> On Wed, Jul 31, 2019 at 2:55 PM, Ryan Blue <rblue@netflix.com> wrote:
>
>> Another important aspect of this problem is whether a user is conscious
>> of the cast that is inserted by Spark. Most of the time, users are not
>> aware of casts that are implicitly inserted, and that means replacing
>> values with NULL would be a very surprising behavior. The impact of this
>> choice affects users disproportionately: someone that knows about inserted
>> casts is mildly annoyed when required to add an explicit cast, but a user
>> that doesn't know an inserted cast is dropping values is very negatively
>> impacted and may not discover the problem until it is too late.
>>
>> That disproportionate impact is what makes me think that it is not okay
>> for Spark to silently replace values with NULL, even if that's what ANSI
>> would allow. Other databases also have the ability to reject null values in
>> tables, providing extra insurance against the problem, but Spark doesn't
>> have required columns in its DDL.
>>
>> So while I agree with Reynold that there is a trade-off, I think that
>> trade-off makes the choice between a runtime error and an analysis-time
>> error. I'm okay with either a runtime error as the default or an analysis
>> error as the default, as long as there is a setting that allows me to
>> choose one for my deployment.
>>
>>
>> On Wed, Jul 31, 2019 at 10:39 AM Reynold Xin <rxin@databricks.com> wrote:
>>
>> OK to push back: "disagreeing with the premise that we can afford to not
>> be maximal on standard 3. The correctness of the data is non-negotiable,
>> and whatever solution we settle on cannot silently adjust the user’s data
>> under any circumstances."
>>
>> This blanket statement sounds great on surface, but there are a lot of
>> subtleties. "Correctness" is absolutely important, but engineering/prod
>> development are often about tradeoffs, and the industry has consistently
>> traded correctness for performance or convenience, e.g. overflow checks,
>> null pointers, consistency in databases ...
>>
>> It all depends on the use cases and to what degree use cases can
>> tolerate. For example, while I want my data engineering production pipeline
>> to throw any error when the data doesn't match my expectations (e.g. type
>> widening, overflow), if I'm doing some quick and dirty data science, I
>> don't want the job to just fail due to outliers.
>>
>>
>>
>> On Wed, Jul 31, 2019 at 10:13 AM, Matt Cheah <mcheah@palantir.com> wrote:
>>
>> Sorry I meant the current behavior for V2, which fails the query
>> compilation if the cast is not safe.
>>
>>
>>
>> Agreed that a separate discussion about overflow might be warranted. I’m
>> surprised we don’t throw an error now, but it might be warranted to do so.
>>
>>
>>
>> -Matt Cheah
>>
>>
>>
>> *From: *Reynold Xin <rxin@databricks.com>
>> *Date: *Wednesday, July 31, 2019 at 9:58 AM
>> *To: *Matt Cheah <mcheah@palantir.com>
>> *Cc: *Russell Spitzer <russell.spitzer@gmail.com>, Takeshi Yamamuro <
>> linguin.m.s@gmail.com>, Gengliang Wang <gengliang.wang@databricks.com>,
>> Ryan Blue <rblue@netflix.com>, Spark dev list <dev@spark.apache.org>,
>> Hyukjin Kwon <gurwls223@gmail.com>, Wenchen Fan <cloud0fan@gmail.com>
>> *Subject: *Re: [Discuss] Follow ANSI SQL on table insertion
>>
>>
>>
>> Matt what do you mean by maximizing 3, while allowing not throwing errors
>> when any operations overflow? Those two seem contradicting.
>>
>>
>>
>>
>>
>> On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah <mcheah@palantir.com> wrote:
>>
>> I’m -1, simply from disagreeing with the premise that we can afford to
>> not be maximal on standard 3. The correctness of the data is
>> non-negotiable, and whatever solution we settle on cannot silently adjust
>> the user’s data under any circumstances.
>>
>>
>>
>> I think the existing behavior is fine, or perhaps the behavior can be
>> flagged by the destination writer at write time.
>>
>>
>>
>> -Matt Cheah
>>
>>
>>
>> *From: *Hyukjin Kwon <gurwls223@gmail.com>
>> *Date: *Monday, July 29, 2019 at 11:33 PM
>> *To: *Wenchen Fan <cloud0fan@gmail.com>
>> *Cc: *Russell Spitzer <russell.spitzer@gmail.com>, Takeshi Yamamuro <
>> linguin.m.s@gmail.com>, Gengliang Wang <gengliang.wang@databricks.com>,
>> Ryan Blue <rblue@netflix.com>, Spark dev list <dev@spark.apache.org>
>> *Subject: *Re: [Discuss] Follow ANSI SQL on table insertion
>>
>>
>>
>> From my look, +1 on the proposal, considering ASCI and other DBMSes in
>> general.
>>
>>
>>
>> 2019년 7월 30일 (화) 오후 3:21, Wenchen Fan <cloud0fan@gmail.com>님이
작성:
>>
>> We can add a config for a certain behavior if it makes sense, but the
>> most important thing we want to reach an agreement here is: what should be
>> the default behavior?
>>
>>
>>
>> Let's explore the solution space of table insertion behavior first:
>>
>> At compile time,
>>
>> 1. always add cast
>>
>> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
>> int is forbidden but long to int is allowed)
>>
>> 3. only add cast if it's 100% safe
>>
>> At runtime,
>>
>> 1. return null for invalid operations
>>
>> 2. throw exceptions at runtime for invalid operations
>>
>>
>>
>> The standards to evaluate a solution:
>>
>> 1. How robust the query execution is. For example, users usually don't
>> want to see the query fails midway.
>>
>> 2. how tolerant to user queries. For example, a user would like to write
>> long values to an int column as he knows all the long values won't exceed
>> int range.
>>
>> 3. How clean the result is. For example, users usually don't want to see
>> silently corrupted data (null values).
>>
>>
>>
>> The current Spark behavior for Data Source V1 tables: always add cast and
>> return null for invalid operations. This maximizes standard 1 and 2, but
>> the result is least clean and users are very likely to see silently
>> corrupted data (null values).
>>
>>
>>
>> The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
>> only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
>> queries may fail to compile, even if these queries can run on other SQL
>> systems. Note that, people can still see silently corrupted data because
>> cast is not the only one that can return corrupted data. Simple operations
>> like ADD can also return corrected data if overflow happens. e.g. INSERT
>> INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2
>>
>>
>>
>> The proposal here: add cast following ANSI SQL store assignment rule, and
>> return null for invalid operations. This maximizes standard 1, and also
>> fits standard 2 well: if a query can't compile in Spark, it usually can't
>> compile in other mainstream databases as well. I think that's tolerant
>> enough. For standard 3, this proposal doesn't maximize it but can avoid
>> many invalid operations already.
>>
>>
>>
>> Technically we can't make the result 100% clean at compile-time, we have
>> to handle things like overflow at runtime. I think the new proposal makes
>> more sense as the default behavior.
>>
>>
>>
>>
>>
>> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer <
>> russell.spitzer@gmail.com> wrote:
>>
>> I understand spark is making the decisions, i'm say the actual final
>> effect of the null decision would be different depending on the insertion
>> target if the target has different behaviors for null.
>>
>>
>>
>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>> > I'm a big -1 on null values for invalid casts.
>>
>>
>>
>> This is why we want to introduce the ANSI mode, so that invalid cast
>> fails at runtime. But we have to keep the null behavior for a while, to
>> keep backward compatibility. Spark returns null for invalid cast since the
>> first day of Spark SQL, we can't just change it without a way to restore to
>> the old behavior.
>>
>>
>>
>> I'm OK with adding a strict mode for the upcast behavior in table
>> insertion, but I don't agree with making it the default. The default
>> behavior should be either the ANSI SQL behavior or the legacy Spark
>> behavior.
>>
>>
>>
>> > other modes should be allowed only with strict warning the behavior
>> will be determined by the underlying sink.
>>
>>
>>
>> Seems there is some misunderstanding. The table insertion behavior is
>> fully controlled by Spark. Spark decides when to add cast and Spark decided
>> whether invalid cast should return null or fail. The sink is only
>> responsible for writing data, not the type coercion/cast stuff.
>>
>>
>>
>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <
>> russell.spitzer@gmail.com> wrote:
>>
>> I'm a big -1 on null values for invalid casts. This can lead to a lot of
>> even more unexpected errors and runtime behavior since null is
>>
>>
>>
>> 1. Not allowed in all schemas (Leading to a runtime error anyway)
>> 2. Is the same as delete in some systems (leading to data loss)
>>
>> And this would be dependent on the sink being used. Spark won't just be
>> interacting with ANSI compliant sinks so I think it makes much more sense
>> to be strict. I think Upcast mode is a sensible default and other modes
>> should be allowed only with strict warning the behavior will be determined
>> by the underlying sink.
>>
>>
>>
>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro <linguin.m.s@gmail.com>
>> wrote:
>>
>> Hi, all
>>
>>
>>
>> +1 for implementing this new store cast mode.
>>
>> From a viewpoint of DBMS users, this cast is pretty common for INSERTs
>> and I think this functionality could
>>
>> promote migrations from existing DBMSs to Spark.
>>
>>
>>
>> The most important thing for DBMS users is that they could optionally
>> choose this mode when inserting data.
>>
>> Therefore, I think it might be okay that the two modes (the current
>> upcast mode and the proposed store cast mode)
>>
>> co-exist for INSERTs. (There is a room to discuss which mode  is enabled
>> by default though...)
>>
>>
>>
>> IMHO we'll provide three behaviours below for INSERTs;
>>
>>  - upcast mode
>>
>>  - ANSI store cast mode and runtime exceptions thrown for invalid values
>>
>>  - ANSI store cast mode and null filled for invalid values
>>
>>
>>
>>
>>
>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang <
>> gengliang.wang@databricks.com> wrote:
>>
>> Hi Ryan,
>>
>>
>>
>> Thanks for the suggestions on the proposal and doc.
>>
>> Currently, there is no data type validation in table insertion of V1. We
>> are on the same page that we should improve it. But using UpCast is from
>> one extreme to another. It is possible that many queries are broken after
>> upgrading to Spark 3.0.
>>
>> The rules of UpCast are too strict. E.g. it doesn't allow assigning
>> Timestamp type to Date Type, as there will be "precision loss". To me, the
>> type coercion is reasonable and the "precision loss" is under expectation.
>> This is very common in other SQL engines.
>>
>> As long as Spark is following the ANSI SQL store assignment rules, it is
>> users' responsibility to take good care of the type coercion in data
>> writing. I think it's the right decision.
>>
>>
>>
>> > But the new behavior is only applied in DataSourceV2, so it won’t
>> affect existing jobs until sources move to v2 and break other behavior
>> anyway.
>>
>> Eventually, most sources are supposed to be migrated to DataSourceV2 V2.
>> I think we can discuss and make a decision now.
>>
>>
>>
>> > Fixing the silent corruption by adding a runtime exception is not a
>> good option, either.
>>
>> The new optional mode proposed in https://issues.apache.org/jira/browse/SPARK-28512
>> [issues.apache.org]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D28512&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=kluCSIsFxJd_F1ljwRLu_dVRBvOYekEogDH4KWhfkkA&e=>
is
>> disabled by default. This should be fine.
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>> I don't agree with handling literal values specially. Although Postgres
>> does it, I can't find anything about it in the SQL standard. And it
>> introduces inconsistent behaviors which may be strange to users:
>>
>> * What about something like "INSERT INTO t SELECT float_col + 1.1"?
>> * The same insert with a decimal column as input will fail even when a
>> decimal literal would succeed
>> * Similar insert queries with "literal" inputs can be constructed through
>> layers of indirection via views, inline views, CTEs, unions, etc. Would
>> those decimals be treated as columns and fail or would we attempt to make
>> them succeed as well? Would users find this behavior surprising?
>>
>>
>>
>> Silently corrupt data is bad, but this is the decision we made at the
>> beginning when design Spark behaviors. Whenever an error occurs, Spark
>> attempts to return null instead of runtime exception. Recently we provide
>> configs to make Spark fail at runtime for overflow, but that's another
>> story. Silently corrupt data is bad, runtime exception is bad, and
>> forbidding all the table insertions that may fail(even with very little
>> possibility) is also bad. We have to make trade-offs. The trade-offs we
>> made in this proposal are:
>>
>> * forbid table insertions that are very like to fail, at compile time.
>> (things like writing string values to int column)
>>
>> * allow table insertions that are not that likely to fail. If the data is
>> wrong, don't fail, insert null.
>>
>> * provide a config to fail the insertion at runtime if the data is wrong.
>>
>>
>>
>> >  But the new behavior is only applied in DataSourceV2, so it won’t
>> affect existing jobs until sources move to v2 and break other behavior
>> anyway.
>>
>> When users write SQL queries, they don't care if a table is backed by
>> Data Source V1 or V2. We should make sure the table insertion behavior is
>> consistent and reasonable. Furthermore, users may even not care if the SQL
>> queries are run in Spark or other RDBMS, it's better to follow SQL standard
>> instead of introducing a Spark-specific behavior.
>>
>>
>>
>> We are not talking about a small use case like allowing writing decimal
>> literal to float column, we are talking about a big goal to make Spark
>> compliant to SQL standard, w.r.t. https://issues.apache.org/jira/browse/SPARK-26217
>> [issues.apache.org]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D26217&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=Y9ReXszbCYw7dLvhzbVJD7wYppANZpZtUkoSNJ-MJC4&e=>
.
>> This proposal is a sub-task of it, to make the table insertion behavior
>> follow SQL standard.
>>
>>
>>
>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rblue@netflix.com> wrote:
>>
>> I don’t think this is a good idea. Following the ANSI standard is usually
>> fine, but here it would *silently corrupt data*.
>>
>> From your proposal doc, ANSI allows implicitly casting from long to int
>> (any numeric type to any other numeric type) and inserts NULL when a value
>> overflows. That would drop data values and is not safe.
>>
>> Fixing the silent corruption by adding a runtime exception is not a good
>> option, either. That puts off the problem until much of the job has
>> completed, instead of catching the error at analysis time. It is better to
>> catch this earlier during analysis than to run most of a job and then fail.
>>
>> In addition, part of the justification for using the ANSI standard is to
>> avoid breaking existing jobs. But the new behavior is only applied in
>> DataSourceV2, so it won’t affect existing jobs until sources move to v2 and
>> break other behavior anyway.
>>
>> I think that the correct solution is to go with the existing validation
>> rules that require explicit casts to truncate values.
>>
>> That still leaves the use case that motivated this proposal, which is
>> that floating point literals are parsed as decimals and fail simple insert
>> statements. We already came up with two alternatives to fix that problem in
>> the DSv2 sync and I think it is a better idea to go with one of those
>> instead of “fixing” Spark in a way that will corrupt data or cause runtime
>> failures.
>>
>>
>>
>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>> I have heard about many complaints about the old table insertion
>> behavior. Blindly casting everything will leak the user mistake to a late
>> stage of the data pipeline, and make it very hard to debug. When a user
>> writes string values to an int column, it's probably a mistake and the
>> columns are misordered in the INSERT statement. We should fail the query
>> earlier and ask users to fix the mistake.
>>
>>
>>
>> In the meanwhile, I agree that the new table insertion behavior we
>> introduced for Data Source V2 is too strict. It may fail valid queries
>> unexpectedly.
>>
>>
>>
>> In general, I support the direction of following the ANSI SQL standard.
>> But I'd like to do it with 2 steps:
>>
>> 1. only add cast when the assignment rule is satisfied. This should be
>> the default behavior and we should provide a legacy config to restore to
>> the old behavior.
>>
>> 2. fail the cast operation at runtime if overflow happens. AFAIK Marco
>> Gaido is working on it already. This will have a config as well and by
>> default we still return null.
>>
>>
>>
>> After doing this, the default behavior will be slightly different from
>> the SQL standard (cast can return null), and users can turn on the ANSI
>> mode to fully follow the SQL standard. This is much better than before and
>> should prevent a lot of user mistakes. It's also a reasonable choice to me
>> to not throw exceptions at runtime by default, as it's usually bad for
>> long-running jobs.
>>
>>
>>
>> Thanks,
>>
>> Wenchen
>>
>>
>>
>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
>> gengliang.wang@databricks.com> wrote:
>>
>> Hi everyone,
>>
>>
>>
>> I would like to discuss the table insertion behavior of Spark. In the
>> current data source V2, only UpCast is allowed for table insertion. I think
>> following ANSI SQL is a better idea.
>>
>> For more information, please read the Discuss: Follow ANSI SQL on table
>> insertion [docs.google.com]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1b9nnWWbKVDRp7lpzhQS1buv1-5FlDzWIZY2ApFs5rBcGI_edit-3Fusp-3Dsharing&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=bLJxfpdKatSv5gaYIKVWx29RmEBKz-DTtRwuYd7lYks&e=>
>>
>> Please let me know if you have any thoughts on this.
>>
>>
>>
>> Regards,
>>
>> Gengliang
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>> --
>>
>> ---
>> Takeshi Yamamuro
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

Mime
View raw message