samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jul...@hydromatic.net>
Subject Re: Windowing Guarantees in samza
Date Sun, 15 Feb 2015 21:19:57 GMT
+1

As far as possible, behavior should be deterministic, that is, determined by the data rather
than when the query was started or the arrival time of the data. 

Of course, for the query to make progress, there should be ways to discard late data and to
indicate that a producer is alive but doesn't have any data to send for a particular time
period.  But for normal operation, a slight change in record arrival time or relative order
of records from different producers should not radically change the output. 

We've been having discussions about SQL support for rolling, paged and tumbling windows. We'll
be able to trigger emission of rows at the top of the hour, based on the time stamp of the
data, and other intervals. Punctuation will allow timely emission even if there is no data
flowing. 

Julian 

> On Feb 15, 2015, at 10:51, Benjamin Edwards <edwards.benj@gmail.com> wrote:
> 
> Hi
> 
> Based on what I can see in the run loop class, there are a few things that
> seem a little problematic for windowed processing with respect to time:
> 
> 1) No ability to schedule *when* on an interval you might start. For
> instance, if you wanted to process a window on the hour, every hour, there
> is no way to do this.
> 
> 2) You don't get passed the time. I guess this is simply due to the fact
> that the window isn't really trying to keep up, or pin itself to a given
> phase. If you get behind, well tough. You just added some phase to your
> series.
> 
> What do people normally do to mitigate this? I was thinking that rather
> than using the Windowed task I would simply have the producer use a timer
> and once a period send a control message with the time stamp. This would
> indicate to my task that period was up and state should be flushed to db,
> aggregated to another stream etc..
> 
> Note that I am not trying to do real time processing with hard constraints,
> or anything like that, I just need things that mostly happened within a
> given frame to get grouped and most importantly for things to happen "on
> the minute" or "on the hour" etc.
> 
> Ben

Mime
View raw message