spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Rudenko <petro.rude...@gmail.com>
Subject [sql] Dataframe how to check null values
Date Thu, 02 Apr 2015 11:54:00 GMT
Hi i need to implement MeanImputor - impute missing values with mean. If 
i set missing values to null - then dataframe aggregation works 
properly, but in UDF it treats null values to 0.0. Here’s example:

|val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF 
df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75] 
df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType, 
df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0 null 0.0 
val df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0, 
Double.NaN)).toDF df.agg(avg("_1")).first //res46: 
org.apache.spark.sql.Row = [Double.NaN] |

In UDF i cannot compare scala’s Double to null:

|comparing values of types Double and Null using `==' will always yield 
false [warn] if (value==null) meanValue else value |

With Double.NaN instead of null i can compare in UDF, but aggregation 
doesn’t work properly. Maybe it’s related to : 
https://issues.apache.org/jira/browse/SPARK-6573

Thanks,
Peter Rudenko

​

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message