Untested, but something like the below should work:

from pyspark.sql import functions as F
from pyspark.sql import window as W

(record
.withColumn('ts_rank', F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id"))
.filter(F.col('ts_rank')==1)
.drop('ts_rank')
)


On Mon, Dec 17, 2018 at 4:04 PM Nikhil Goyal <nownikhil@gmail.com> wrote:
Hi guys,

I have a dataframe of type Record (id: Long, timestamp: Long, isValid: Boolean, .... other metrics)

Schema looks like this:
root
 |-- id: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- isValid: boolean (nullable = true)
.....

I need to find the earliest valid record per id. In RDD world I can do groupBy 'id' and find the earliest one but I am not sure how I can do it in SQL. Since I am doing this in PySpark I cannot really use DataSet API for this.

One thing I can do is groupBy 'id', find the earliest timestamp available and then join with the original dataframe to get the right record (all the metrics).

Or I can create a single column with all the records and then implement a UDAF in scala and use it in pyspark.

Both solutions don't seem to be straight forward. Is there a simpler solution to this?

Thanks
Nikhil