spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: [sql] Dataframe how to check null values
Date Mon, 20 Apr 2015 12:24:55 GMT
I found:
https://issues.apache.org/jira/browse/SPARK-6573



> On Apr 20, 2015, at 4:29 AM, Peter Rudenko <petro.rudenko@gmail.com> wrote:
> 
> Sounds very good. Is there a jira for this? Would be cool to have in 1.4, because currently
cannot use dataframe.describe function with NaN values, need to filter manually all the columns.
> 
> Thanks,
> Peter Rudenko
> 
>> On 2015-04-02 21:18, Reynold Xin wrote:
>> Incidentally, we were discussing this yesterday. Here are some thoughts on null handling
in SQL/DataFrames. Would be great to get some feedback.
>> 
>> 1. Treat floating point NaN and null as the same "null" value. This would be consistent
with most SQL databases, and Pandas. This would also require some inbound conversion.
>> 
>> 2. Internally, when we see a NaN value, we should mark the null bit as true, and
keep the NaN value. When we see a null value for a floating point field, we should mark the
null bit as true, and update the field to store NaN.
>> 
>> 3. Externally, for floating point values, return NaN when the value is null.
>> 
>> 4. For all other types, return null for null values.
>> 
>> 5. For UDFs, if the argument is primitive type only (i.e. does not handle null) and
not a floating point field, simply evaluate the expression to null. This is consistent with
most SQL UDFs and most programming languages' treatment of NaN.
>> 
>> 
>> Any thoughts on this semantics?
>> 
>> 
>> On Thu, Apr 2, 2015 at 5:51 AM, Dean Wampler <deanwampler@gmail.com <mailto:deanwampler@gmail.com>>
wrote:
>> 
>>    I'm afraid you're a little stuck. In Scala, the types Int, Long,
>>    Float,
>>    Double, Byte, and Boolean look like reference types in source
>>    code, but
>>    they are compiled to the corresponding JVM primitive types, which
>>    can't be
>>    null. That's why you get the warning about ==.
>> 
>>    It might be your best choice is to use NaN as the placeholder for
>>    null,
>>    then create one DF using a filter that removes those values. Use
>>    that DF to
>>    compute the mean. Then apply a map step to the original DF to
>>    translate the
>>    NaN's to the mean.
>> 
>>    dean
>> 
>>    Dean Wampler, Ph.D.
>>    Author: Programming Scala, 2nd Edition
>>    <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>>    Typesafe <http://typesafe.com>
>>    @deanwampler <http://twitter.com/deanwampler>
>>    http://polyglotprogramming.com
>> 
>>    On Thu, Apr 2, 2015 at 7:54 AM, Peter Rudenko
>>    <petro.rudenko@gmail.com <mailto:petro.rudenko@gmail.com>>
>>    wrote:
>> 
>>    > Hi i need to implement MeanImputor - impute missing values with
>>    mean. If i
>>    > set missing values to null - then dataframe aggregation works
>>    properly, but
>>    > in UDF it treats null values to 0.0. Here’s example:
>>    >
>>    > |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF
>>    > df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75]
>>    > df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType,
>>    > df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0
>>    null 0.0 val
>>    > df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0,
>>    Double.NaN)).toDF
>>    > df.agg(avg("_1")).first //res46: org.apache.spark.sql.Row =
>>    [Double.NaN] |
>>    >
>>    > In UDF i cannot compare scala’s Double to null:
>>    >
>>    > |comparing values of types Double and Null using `==' will
>>    always yield
>>    > false [warn] if (value==null) meanValue else value |
>>    >
>>    > With Double.NaN instead of null i can compare in UDF, but
>>    aggregation
>>    > doesn’t work properly. Maybe it’s related to :
>>    https://issues.apache.org/
>>    > jira/browse/SPARK-6573
>>    >
>>    > Thanks,
>>    > Peter Rudenko
>>    >
>>    > ​
>>    >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message