samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Edwards <edwards.b...@gmail.com>
Subject Re: Windowing Guarantees in samza
Date Mon, 16 Feb 2015 21:40:51 GMT
Thanks for the responses, much appreciated. I will continue to experiment.

Ben

On Sun Feb 15 2015 at 21:22:17 Julian Hyde <julian@hydromatic.net> wrote:

> +1
>
> As far as possible, behavior should be deterministic, that is, determined
> by the data rather than when the query was started or the arrival time of
> the data.
>
> Of course, for the query to make progress, there should be ways to discard
> late data and to indicate that a producer is alive but doesn't have any
> data to send for a particular time period.  But for normal operation, a
> slight change in record arrival time or relative order of records from
> different producers should not radically change the output.
>
> We've been having discussions about SQL support for rolling, paged and
> tumbling windows. We'll be able to trigger emission of rows at the top of
> the hour, based on the time stamp of the data, and other intervals.
> Punctuation will allow timely emission even if there is no data flowing.
>
> Julian
>
> > On Feb 15, 2015, at 10:51, Benjamin Edwards <edwards.benj@gmail.com>
> wrote:
> >
> > Hi
> >
> > Based on what I can see in the run loop class, there are a few things
> that
> > seem a little problematic for windowed processing with respect to time:
> >
> > 1) No ability to schedule *when* on an interval you might start. For
> > instance, if you wanted to process a window on the hour, every hour,
> there
> > is no way to do this.
> >
> > 2) You don't get passed the time. I guess this is simply due to the fact
> > that the window isn't really trying to keep up, or pin itself to a given
> > phase. If you get behind, well tough. You just added some phase to your
> > series.
> >
> > What do people normally do to mitigate this? I was thinking that rather
> > than using the Windowed task I would simply have the producer use a timer
> > and once a period send a control message with the time stamp. This would
> > indicate to my task that period was up and state should be flushed to db,
> > aggregated to another stream etc..
> >
> > Note that I am not trying to do real time processing with hard
> constraints,
> > or anything like that, I just need things that mostly happened within a
> > given frame to get grouped and most importantly for things to happen "on
> > the minute" or "on the hour" etc.
> >
> > Ben
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message