spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Sirek (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled
Date Sat, 06 Jul 2019 18:38:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879735#comment-16879735
] 

Mark Sirek commented on SPARK-28067:
------------------------------------

I tried the test on 4 different systems, all immediately after downloading Spark, and changing
no settings, so they should all be the defaults.  None of the tests return null.  I'm not
sure which config settings I should change.  I wouldn't think it's expected behavior for
Spark to return an incorrect answer with a certain config setting, unless there's a setting
which controls the hiding or overflows.

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --------------------------------------------------------------------------
>
>                 Key: SPARK-28067
>                 URL: https://issues.apache.org/jira/browse/SPARK-28067
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0, 2.4.0
>         Environment: Ubuntu LTS 16.04
> Oracle Java 1.8.0_201
> spark-2.4.3-bin-without-hadoop
> spark-shell
>            Reporter: Mark Sirek
>            Priority: Minor
>              Labels: correctness
>
> The following test case involving a join followed by a sum aggregation returns the wrong
answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("10000000000000000000"), 1),
>  (BigDecimal("10000000000000000000"), 1),
>  (BigDecimal("10000000000000000000"), 2),
>  (BigDecimal("10000000000000000000"), 2),
>  (BigDecimal("10000000000000000000"), 2),
>  (BigDecimal("10000000000000000000"), 2),
>  (BigDecimal("10000000000000000000"), 2),
>  (BigDecimal("10000000000000000000"), 2),
>  (BigDecimal("10000000000000000000"), 2),
>  (BigDecimal("10000000000000000000"), 2),
>  (BigDecimal("10000000000000000000"), 2),
>  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---------------------------------------
> sum(decNum)
> ---------------------------------------
> 40000000000000000000.000000000000000000
> ---------------------------------------
>  
> {code}
>  
> The result should be 1040000000000000000000.0000000000000000.
> It appears a partial sum is computed for each join key, as the result returned would
be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: decimal(38,18),
intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  -----------
> sum(decNum)
> -----------
> null
> -----------
>  
> {code}
>  
> The correct answer, 1000000000000000000000.0000000000000000, doesn't fit in the DataType
picked for the result, decimal(38,18), so an overflow occurs, which Spark then converts
to null.
> The first example, which doesn't filter out the intNum === 1 values should also return
null, indicating overflow, but it doesn't.  This may mislead the user to think a valid sum
was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds
max precision 38
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message