spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Peletz <alexand...@slalom.com>
Subject RE: pyspark.sql.functions.last not working as expected
Date Thu, 18 Aug 2016 00:47:44 GMT
So here is the test case from the commit adding the first/last methods here: https://github.com/apache/spark/pull/10957/commits/defcc02a8885e884d5140b11705b764a51753162



+  test("last/first with ignoreNulls") {


+    val nullStr: String = null


+    val df = Seq(


+      ("a", 0, nullStr),


+      ("a", 1, "x"),


+      ("a", 2, "y"),


+      ("a", 3, "z"),


+      ("a", 4, nullStr),


+      ("b", 1, nullStr),


+      ("b", 2, nullStr)).


+      toDF("key", "order", "value")


+    val window = Window.partitionBy($"key").orderBy($"order")


+    checkAnswer(


+      df.select(


+        $"key",


+        $"order",


+        first($"value").over(window),


+        first($"value", ignoreNulls = false).over(window),


+        first($"value", ignoreNulls = true).over(window),


+        last($"value").over(window),


+        last($"value", ignoreNulls = false).over(window),


+        last($"value", ignoreNulls = true).over(window)),


+      Seq(


+        Row("a", 0, null, null, null, null, null, null),


+        Row("a", 1, null, null, "x", "x", "x", "x"),


+        Row("a", 2, null, null, "x", "y", "y", "y"),


+        Row("a", 3, null, null, "x", "z", "z", "z"),


+        Row("a", 4, null, null, "x", null, null, "z"),


+        Row("b", 1, null, null, null, null, null, null),


+        Row("b", 2, null, null, null, null, null, null)))


+  }



I would expect the correct results to be as follows instead of what is used above. Shouldn't
we always return the first or last value in the partition based on the ordering? It looks
something else is going on... can someone explain?

+      Seq(

+        Row("a", 0, null, null, "x", null, null, "z"),

+        Row("a", 1, null, null, "x", null, null, "z"),

+        Row("a", 2, null, null, "x", null, null, "z"),

+        Row("a", 3, null, null, "x", null, null, "z"),

+        Row("a", 4, null, null, "x", null, null, "z"),

+        Row("b", 1, null, null, null, null, null, null),

+        Row("b", 2, null, null, null, null, null, null)))



From: Alexander Peletz [mailto:alexanderp@slalom.com]
Sent: Wednesday, August 17, 2016 11:57 AM
To: user <user@spark.apache.org>
Subject: pyspark.sql.functions.last not working as expected

Hi,

I am using Spark 2.0 and I am getting unexpected results using the last() method. Has anyone
else experienced this? I get the sense that last() is working correctly within a given data
partition but not across the entire RDD. First() seems to work as expected so I can work around
this by having a window that is in reverse order and use first() instead of last() but it
would be great if last() actually worked.


Thanks,
Alexander


Alexander Peletz
Consultant

slalom

Fortune 100 Best Companies to Work For 2016
Glassdoor Best Places to Work 2016
Consulting Magazine Best Firms to Work For 2015

316 Stuart Street, Suite 300
Boston, MA 02116
706.614.5033 cell | 617.316.5400 office
alexanderp@slalom.com<mailto:alexanderp@slalom.com>


Mime
View raw message