spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <>
Subject Re: Correctness and data loss issues
Date Wed, 22 Jan 2020 17:43:17 GMT
Hi, Tom.

Then, along with the following, do you think we need to hold on 2.4.5
release, too?

> If it's really a correctness issue we should hold 3.0 for it.


    (1) 2.4.4 delivered 9 correctness patches.
    (2) 2.4.5 RC1 aimed to deliver the following 9 correctness patches, too.

        SPARK-29101 CSV datasource returns incorrect .count() from file
with malformed records
        SPARK-30447 Constant propagation nullability issue
        SPARK-29708 Different answers in aggregates of duplicate grouping
        SPARK-29651 Incorrect parsing of interval seconds fraction
        SPARK-29918 RecordBinaryComparator should check endianness when
compared by long
        SPARK-29042 Sampling-based RDD with unordered input should be
        SPARK-30082 Zeros are being treated as NaNs
        SPARK-29743 sample should set needCopyResult to true if its child is
        SPARK-26985 Test "access only some column of the all of columns "
fails on big endian

Without the official Apache Spark 2.4.5 binaries,
there is no official way to deliver the 9 correctness fixes in (2) to the
In addition, usually, the correctness fixes are independent to each other.


On Wed, Jan 22, 2020 at 7:02 AM Tom Graves <> wrote:

> I agree, I think we just need to go through all of them and individual
> assess each one. If it's really a correctness issue we should hold 3.0 for
> it.
> On the 2.4 release I didn't see an explanation on
> why it can't be back
> ported, I think in the very least we need that in each jira comment.
> spark-29701 looks more like compatibility with Postgres then a purely
> wrong answer to me, if Spark has been consistent about that it feels like
> it can wait for 3.0 but would be good to get others input and I'm not an
> expert on SQL standard and what do the other sql engines do in this case.
> Tom
> On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun <
>> wrote:
> Hi, All.
> According to our policy, "Correctness and data loss issues should be
> considered Blockers".
>     -
> Since we are close to branch-3.0 cut,
> I want to ask your opinions on the following correctness and data loss
> issues.
>     SPARK-30218 Columns used in inequality conditions for joins not
> resolved correctly in case of common lineage
>     SPARK-29701 Different answers when empty input given in GROUPING SETS
>     SPARK-29699 Different answers in nested aggregates with window
> functions
>     SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
>     SPARK-28125 dataframes created by randomSplit have overlapping rows
>     SPARK-28067 Incorrect results in decimal aggregation with whole-stage
> code gen enabled
>     SPARK-28024 Incorrect numeric values when out of range
>     SPARK-27784 Alias ID reuse can break correctness when substituting
> foldable expressions
>     SPARK-27619 MapType should be prohibited in hash expressions
>     SPARK-27298 Dataset except operation gives different results(dataset
> count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
>     SPARK-27282 Spark incorrect results when using UNION with GROUP BY
> clause
>     SPARK-27213 Unexpected results when filter is used after distinct
>     SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive
> table if schema evolves
>     SPARK-25150 Joining DataFrames derived from the same source yields
> confusing/incorrect results
>     SPARK-21774 The rule PromoteStrings cast string to a wrong data type
>     SPARK-19248 Regex_replace works in 1.6 but not in 2.0
> Some of them are targeted on 3.0.0, but the others are not.
> Although we will work on them until 3.0.0,
> I'm not sure we can reach a status with no known correctness and data loss
> issue.
> How do you think about the above issues?
> Bests,
> Dongjoon.

View raw message