spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikhil Goyal <nownik...@gmail.com>
Subject Need help with SparkSQL Query
Date Mon, 17 Dec 2018 21:03:31 GMT
Hi guys,

I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
Boolean, .... other metrics)

Schema looks like this:
root
 |-- id: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- isValid: boolean (nullable = true)
.....

I need to find the earliest valid record per id. In RDD world I can do
groupBy 'id' and find the earliest one but I am not sure how I can do it in
SQL. Since I am doing this in PySpark I cannot really use DataSet API for
this.

One thing I can do is groupBy 'id', find the earliest timestamp available
and then join with the original dataframe to get the right record (all the
metrics).

Or I can create a single column with all the records and then implement a
UDAF in scala and use it in pyspark.

Both solutions don't seem to be straight forward. Is there a simpler
solution to this?

Thanks
Nikhil

Mime
View raw message