spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miquel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
Date Mon, 03 Dec 2018 09:20:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706867#comment-16706867
] 

Miquel commented on SPARK-26233:
--------------------------------

Hi [~hyukjin.kwon], sure, I can reproduce it using Rows. It happens when the scale of the
Decimal doesn't match with the real scale value. It's worse with a Java Bean because it's
not posible change the Decimal scale value of the encoder.

Using *first*, the result is wrong.
{code:java}
  List<StructField> fields = new ArrayList<>();
  fields.add(DataTypes.createStructField("group", DataTypes.StringType, true));
  fields.add(DataTypes.createStructField("var", DataTypes.createDecimalType(38, 8), true));
  ExpressionEncoder<Row> encoder = RowEncoder.apply(DataTypes.createStructType(fields));

  Dataset<Row> ds = spark.range(5)
      .map(l -> RowFactory.create("" + l, BigDecimal.valueOf(l + 0.1111)), encoder);
  ds.show();

+-----+------+
|group| var|
+-----+------+
| 0|0.1111|
| 1|1.1111|
| 2|2.1111|
| 3|3.1111|
| 4|4.1111|
+-----+------+

  ds.groupBy(col("group"))
      .agg(
          first(col("var"))
      )
      .show();
}

+-----+-----------------+
|group|first(var, false)|
+-----+-----------------+
| 3| 0.00031111|
| 0| 0.00001111|
| 1| 0.00011111|
| 4| 0.00041111|
| 2| 0.00021111|
+-----+-----------------+
{code}
But it works fine again if we use *sum*
{code:java}
  ds.groupBy(col("group"))
    .agg(
        sum(col("var"))
    )
    .show();

+-----+----------+
|group| sum(var)|
+-----+----------+
| 3|3.11110000|
| 0|0.11110000|
| 1|1.11110000|
| 4|4.11110000|
| 2|2.11110000|
+-----+----------+
{code}

> Incorrect decimal value with java beans and first/last/max... functions
> -----------------------------------------------------------------------
>
>                 Key: SPARK-26233
>                 URL: https://issues.apache.org/jira/browse/SPARK-26233
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.3.1, 2.4.0
>            Reporter: Miquel
>            Priority: Minor
>
> Decimal values from Java beans are incorrectly scaled when used with functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as _DecimalType(this.MAX_PRECISION(),
18)._
> Usually it's not a problem if you use numeric functions like *sum* but for functions
like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
>     return var;
>   }
>   public void setVar(BigDecimal var) {
>     this.var = var;
>   }
>   public String getGroup() {
>     return group;
>   }
>   public void setGroup(String group) {
>     this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset<Foo> ds = spark.range(5)
>     .map(l -> {
>       Foo foo = new Foo();
>       foo.setGroup("" + l);
>       foo.setVar(BigDecimal.valueOf(l + 0.1111));
>       return foo;
>     }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-----+------+
> |group| var|
> +-----+------+
> | 0|0.1111|
> | 1|1.1111|
> | 2|2.1111|
> | 3|3.1111|
> | 4|4.1111|
> +-----+------+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values are show
correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
>     .agg(
>         first("var")
>     )
>     .show();
> +-----+-----------------+
> |group|first(var, false)|
> +-----+-----------------+
> | 3| 3.1111E-14|
> | 0| 1.111E-15|
> | 1| 1.1111E-14|
> | 4| 4.1111E-14|
> | 2| 2.1111E-14|
> +-----+-----------------+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions like sum
or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
>     .agg(
>         sum("var")
>     )
>     .show();
> +-----+--------------------+
> |group| sum(var)|
> +-----+--------------------+
> | 3|3.111100000000000000|
> | 0|0.111100000000000000|
> | 1|1.111100000000000000|
> | 4|4.111100000000000000|
> | 2|2.111100000000000000|
> +-----+--------------------+
> ds.groupBy(col("group"))
>     .agg(
>         first(col("var").cast(new DecimalType(38, 8)))
>     )
>     .show();
> +-----+----------------------------------------+
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-----+----------------------------------------+
> | 3| 3.11110000|
> | 0| 0.11110000|
> | 1| 1.11110000|
> | 4| 4.11110000|
> | 2| 2.11110000|
> +-----+----------------------------------------+
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message