spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: [Spark Streaming] Session based windowing like in google dataflow
Date Sat, 08 Aug 2015 01:28:36 GMT
You can use Spark Streaming's updateStateByKey to do arbitrary
sessionization.
See the example -
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
All it does is store a single number (count of each word seeing since the
beginning), but you can extend it store arbitrary state data. And then you
can use that state to keep track of gaps, windows, etc.
People have done a lot sessionization using this. I am sure others can
chime in.

On Fri, Aug 7, 2015 at 10:48 AM, Ankur Chauhan <ankur@malloc64.com> wrote:

> Hi all,
>
> I am trying to figure out how to perform equivalent of "Session windows"
> (as mentioned in https://cloud.google.com/dataflow/model/windowing) using
> spark streaming. Is it even possible (i.e. possible to do efficiently at
> scale). Just to expand on the definition:
>
> Taken from the google dataflow documentation:
>
> The simplest kind of session windowing specifies a minimum gap duration.
> All data arriving below a minimum threshold of time delay is grouped into
> the same window. If data arrives after the minimum specified gap duration
> time, this initiates the start of a new window.
>
>
>
>
> Any help would be appreciated.
>
> -- Ankur Chauhan
>

Mime
View raw message