spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: rest on streaming
Date Wed, 15 Jul 2015 02:19:21 GMT
You can do this.

// global variable to keep track of latest stuff
var latestTime = _
var latestRDD = _


dstream.foreachRDD((rdd: RDD[..], time: Time) => {
    latestTime = time
    latestRDD = rdd
})

Now you can asynchronously access the latest RDD. However if you are going
to run jobs on the latest RDD, you must tell the streaming subsystem to
keep the necessary data around for longer, otherwise it will get deleted
even before asynchronous query has completed. Use this.

streamingContext.remember(<expected max duration of your async query on
latest RDD>)


On Tue, Jul 14, 2015 at 6:57 PM, Chen Song <chen.song.82@gmail.com> wrote:

> I have been POC adding a rest service in a Spark Streaming job. Say I
> create a stateful DStream X by using updateStateByKey, and each time there
> is a HTTP request, I want to apply some transformations/actions on the
> latest RDD of X and collect the results immediately but not scheduled by
> streaming batch interval.
>
> * Is that even possible?
> * The reason I think of this is because user can get a list of RDDs by
> DStream.window.slice but I cannot find a way to get the most recent RDD in
> the DSteam.
>
>
> --
> Chen Song
>
>

Mime
View raw message