spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds
Date Thu, 29 Jan 2015 15:11:47 GMT
I've implemented message-timestamp based windowing on Kafka streams.  I
dealt with the empty interval problem Tobias mentioned by just blocking
until the next later timestamp was available, then assuming empty intervals
for any intervening windows.

This does require messages are in-order per Kafka partition though.  I
think to deal with out-of-order messages you need to be doing an
aggregation that is monoidal.

At any rate, I think better handling of the distinction between current
time and the time embedded in the datastream is the single biggest
improvement spark streaming needs.

On Thu, Jan 29, 2015 at 1:49 AM, Shao, Saisai <saisai.shao@intel.com> wrote:

> That's definitely a good supplement to the current Spark Streaming, I've
> heard many guys want to process the data using log time. Looking forward to
> the code.
>
> Thanks
> Jerry
>
> -----Original Message-----
> From: Tathagata Das [mailto:tathagata.das1565@gmail.com]
> Sent: Thursday, January 29, 2015 10:33 AM
> To: Tobias Pfeiffer
> Cc: YaoPau; user
> Subject: Re: reduceByKeyAndWindow, but using log timestamps instead of
> clock seconds
>
> Ohhh nice! Would be great if you can share us some code soon. It is indeed
> a very complicated problem and there is probably no single solution that
> fits all usecases. So having one way of doing things would be a great
> reference. Looking forward to that!
>
> On Wed, Jan 28, 2015 at 4:52 PM, Tobias Pfeiffer <tgp@preferred.jp> wrote:
> > Hi,
> >
> > On Thu, Jan 29, 2015 at 1:54 AM, YaoPau <jonrgregg@gmail.com> wrote:
> >>
> >> My thinking is to maintain state in an RDD and update it an persist
> >> it with each 2-second pass, but this also seems like it could get
> >> messy.  Any thoughts or examples that might help me?
> >
> >
> > I have just implemented some timestamp-based windowing on DStreams
> > (can't share the code now, but will be published a couple of months
> > ahead), although with the assumption that items are in correct order.
> > The main challenge (rather technical) was to keep proper state across
> > RDD boundaries and to tell the state "you can mark this partial window
> > from the last interval as 'complete' now" without shuffling too much
> > data around. For example, if there are some empty intervals, you don't
> > know when the next item to go into the partial window will arrive, or
> > if there will be one at all. I guess if you want to have out-of-order
> > tolerance, that will become even trickier, as you need to define and
> > think about some timeout for partial windows in your state...
> >
> > Tobias
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message