spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Null Value in DecimalType column of DataFrame
Date Mon, 21 Sep 2015 20:12:32 GMT
+dev list

Hi Dirceu,

The answer to whether throwing an exception is better or null is better
depends on your use case. If you are debugging and want to find bugs with
your program, you might prefer throwing an exception. However, if you are
running on a large real-world dataset (i.e. data is dirty) and your query
can take a while (e.g. 30 mins), you then might prefer the system to just
assign null values to the dirty data that could lead to runtime exceptions,
because otherwise you could be spending days just to clean your data.

Postgres throws exceptions here, but I think that's mainly because it is
used for OLTP, and in those cases queries are short-running. Most other
analytic databases I believe just return null. The best we can do is to
provide a config option to indicate behavior for exception handling.


On Fri, Sep 18, 2015 at 8:15 AM, Dirceu Semighini Filho <
dirceu.semighini@gmail.com> wrote:

> Hi Yin, I got that part.
> I just think that instead of returning null, throwing an exception would
> be better. In the exception message we can explain that the DecimalType
> used can't fit the number that is been converted due to the precision and
> scale values used to create it.
> It would be easier for the user to find the reason why that error is
> happening, instead of receiving an NullPointerException in another part of
> his code. We can also make a better documentation of DecimalType classes to
> explain this behavior, what do you think?
>
>
>
>
> 2015-09-17 18:52 GMT-03:00 Yin Huai <yhuai@databricks.com>:
>
>> As I mentioned before, the range of values of DecimalType(10, 10) is [0,
>> 1). If you have a value 10.5 and you want to cast it to DecimalType(10,
>> 10), I do not think there is any better returned value except of null.
>> Looks like DecimalType(10, 10) is not the right type for your use case. You
>> need a decimal type that has precision - scale >= 2.
>>
>> On Tue, Sep 15, 2015 at 6:39 AM, Dirceu Semighini Filho <
>> dirceu.semighini@gmail.com> wrote:
>>
>>>
>>> Hi Yin, posted here because I think it's a bug.
>>> So, it will return null and I can get a nullpointerexception, as I was
>>> getting. Is this really the expected behavior? Never seen something
>>> returning null in other Scala tools that I used.
>>>
>>> Regards,
>>>
>>>
>>> 2015-09-14 18:54 GMT-03:00 Yin Huai <yhuai@databricks.com>:
>>>
>>>> btw, move it to user list.
>>>>
>>>> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yhuai@databricks.com> wrote:
>>>>
>>>>> A scale of 10 means that there are 10 digits at the right of the
>>>>> decimal point. If you also have precision 10, the range of your data
will
>>>>> be [0, 1) and casting "10.5" to DecimalType(10, 10) will return null,
which
>>>>> is expected.
>>>>>
>>>>> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>>>>>> It seems that there was some changes in org.apache.spark.sql.types.
>>>>>> DecimalType
>>>>>>
>>>>>> This ugly code is a little sample to reproduce the error, don't use
>>>>>> it into your project.
>>>>>>
>>>>>> test("spark test") {
>>>>>>   val file = context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f
=> Row.fromSeq({
>>>>>>     val values = f.split(",")
>>>>>>     Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>>>>>     values.tail.tail.tail.head)}))
>>>>>>
>>>>>>   val structType = StructType(Seq(StructField("id", IntegerType,
false),
>>>>>>     StructField("int2", IntegerType, false), StructField("double",
>>>>>>
>>>>>>  DecimalType(10,10), false),
>>>>>>
>>>>>>
>>>>>>     StructField("str2", StringType, false)))
>>>>>>
>>>>>>   val df = context.sqlContext.createDataFrame(file,structType)
>>>>>>   df.first
>>>>>> }
>>>>>>
>>>>>> The content of the file is:
>>>>>>
>>>>>> 1,5,10.5,va
>>>>>> 2,1,0.1,vb
>>>>>> 3,8,10.0,vc
>>>>>>
>>>>>> The problem resides in DecimalType, before 1.5 the scala wasn't
>>>>>> required. Now when using  DecimalType(12,10) it works fine, but
>>>>>> using DecimalType(10,10) the Decimal values
>>>>>> 10.5 became null, and the 0.1 works.
>>>>>>
>>>>>> Is there anybody working with DecimalType for 1.5.1?
>>>>>>
>>>>>> Regards,
>>>>>> Dirceu
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Mime
View raw message