spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?
Date Tue, 29 Aug 2017 20:38:17 GMT
Say, *trainTimesDataset* is the streaming Dataset of schema *[train: Int,
dest: String, time: Timestamp] *


*Scala*: *trainTimesDataset.groupBy("train", "dest").max("time")*


*SQL*: *"select train, dest, max(time) from trainTimesView group by train,
dest"*    // after calling
*trainTimesData.createOrReplaceTempView(trainTimesView)*


On Tue, Aug 29, 2017 at 12:59 PM, kant kodali <kanth909@gmail.com> wrote:

> Hi All,
>
> I am wondering what is the easiest and concise way to express the
> computation below in Spark Structured streaming given that it supports both
> imperative and declarative styles?
> I am just trying to select rows that has max timestamp for each train?
> Instead of doing some sort of nested queries like we normally do in any
> relational database I am trying to see if I can leverage both imperative
> and declarative at the same time. If nested queries or join are not
> required then I would like to see how this can be possible? I am using
> spark 2.1.1.
>
> Dataset
>
> Train    Dest      Time1        HK        10:001        SH        12:001        SZ  
     14:002        HK        13:002        SH        09:002        SZ        07:00
>
> The desired result should be:
>
> Train    Dest      Time1        SZ        14:002        HK        13:00
>
>

Mime
View raw message