Schema looks like this:
|-- id: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- isValid: boolean (nullable = true)
I need to find the earliest valid record per id. In RDD world I can do groupBy 'id' and find the earliest one but I am not sure how I can do it in SQL. Since I am doing this in PySpark I cannot really use DataSet API for this.
One thing I can do is groupBy 'id', find the earliest timestamp available and then join with the original dataframe to get the right record (all the metrics).
Or I can create a single column with all the records and then implement a UDAF in scala and use it in pyspark.
Both solutions don't seem to be straight forward. Is there a simpler solution to this?