spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-9936) decimal precision lost when loading DataFrame from RDD
Date Fri, 14 Aug 2015 10:14:45 GMT

     [ https://issues.apache.org/jira/browse/SPARK-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated SPARK-9936:
-----------------------------
    Assignee: Liang-Chi Hsieh

> decimal precision lost when loading DataFrame from RDD
> ------------------------------------------------------
>
>                 Key: SPARK-9936
>                 URL: https://issues.apache.org/jira/browse/SPARK-9936
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>            Reporter: Tzach Zohar
>            Assignee: Liang-Chi Hsieh
>             Fix For: 1.5.0
>
>
> It seems that when converting an RDD that contains BigDecimals into a DataFrame (using
SQLContext.createDataFrame without specifying schema), precision info is lost, which means
saving as Parquet file will fail (Parquet tries to verify precision < 18, so fails if it's
unset).
> This seems to be similar to [SPARK-7196|https://issues.apache.org/jira/browse/SPARK-7196],
which fixed the same issue for DataFrames created via JDBC.
> To reproduce:
> {code:none}
> scala> val rdd: RDD[(String, BigDecimal)] = sc.parallelize(Seq(("a", BigDecimal.valueOf(0.234))))
> rdd: org.apache.spark.rdd.RDD[(String, BigDecimal)] = ParallelCollectionRDD[0] at parallelize
at <console>:23
> scala> val df: DataFrame = new SQLContext(rdd.context).createDataFrame(rdd)
> df: org.apache.spark.sql.DataFrame = [_1: string, _2: decimal(10,0)]
> scala> df.write.parquet("/data/parquet-file")
> 15/08/13 10:30:07 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.RuntimeException: Unsupported datatype DecimalType()
> {code}
> To verify this is indeed caused by the precision being lost, I've tried manually changing
the schema to include precision (by traversing the StructFields and replacing the DecimalTypes
with altered DecimalTypes), creating a new DataFrame using this updated schema - and indeed
it fixes the problem.
> I'm using Spark 1.4.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message