samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jul...@hydromatic.net>
Subject Re: Window spec in SQL language vs Samza system details
Date Tue, 10 Feb 2015 21:24:56 GMT
The answer depends on your design philosophy. We need to strike a balance between making it
possible and making it easy. Because SQL is a powerful closed language, we can achieve a lot
by combining the elements. For example, I think that your example can be solved by joining
a "heartbeat" stream to the recent history of the trades stream:

 SELECT STREAM t.*
  FROM (
    SELECT ticker, count(*) 
    FROM Trades OVER (RANGE INTERVAL '1' HOUR PRECEDING)
    GROUP BY ticker) AS t
 JOIN (STREAM Heartbeat(INTERVAL '11' MINUTE))

Note: "Heartbeat" is a user-defined stream function; OVER in the FROM clause to convert a
stream into a relation that contains its history over a particular period relative to the
"current time" of the query.
 
I seems natural to use a heartbeat because of the "input - output cardinality principle":
* If it's one row in, one row out, use windowed aggregation (agg OVER window in the SELECT
clause). 
* If it's several rows in, one row out, use streaming GROUP BY (on time plus possibly other
keys)
* If it's possible that the system produces output when there are no rows in, you need to
use a heartbeat stream

Now, if you take the "make it possible" philosophy to its limit, there is a danger that you
end up with SQL that is so complicated that end users don't understand it, and/or is so complicated
that the planner cannot recognize the pattern and convert it to your nice efficient physical
operator.

That is the point where you very carefully introduce SQL extensions.

But I strongly suggest creating a "cookbook". Give each stream-processing pattern a name and
illustrate using a simple example. Show the SQL that can achieve that pattern. It doesn't
matter if the SQL is a bit gnarly. Users will grok it, and adapt the SQL for their applications.
And they will compose it using union, join, group by... to create new patterns you hadn't
thought of.

If you extend SQL without sufficient thought, you might break composability, and that is a
huge problem. Or you end up writing the planner so that it produces the right plan when it
sees the query in its sugared version but not when expressed using the fundamentals (case
in point: if we had introduced a "tumbling window" syntax, someone could still express it
using the JOIN query above); a lesser problem, but still undermines your users' trust in the
system to do the right thing.

Julian


> On Feb 9, 2015, at 11:46 AM, Yi Pan <nickpan47@gmail.com> wrote:
> 
> Hi, Julian and all,
> 
> We had a discussion in LinkedIn last week regarding to the window spec in
> SQL language on top of Samza systems. There are some issues in the window
> spec that I want to discuss:
> 
> Consider that we want to have a count of stock trades (as a infinite
> stream) happened in the last hour, but only every 11min. It is easy to
> write the first part as:
>   SELECT STREAM rowtime, count(*) OVER (ORDER BY rowtime RANGE INTERVAL
> '1' HOUR PROCEDING)
>       FROM Trades
>   The above will create a stream of counts that happened every hour
> continuously as rows are scanned.
> 
> Now here is the question:
>   a. how do we have the count every 11min instead of as the row comes in?
> As we discussed before, there are examples that we can create by doing
> truncating / grouping on the rowtime to "sample" the continuous moving
> counting window to get a count every 11min. But that has two issues:
>      - From implementation point of view, there is no efficiency
> improvement since the system still computes the count for each and every
> row comes in
>      - If Samza implement a more efficient tumbling window operator, there
> is no easy way to identify the section of SQL statement that can map to the
> more efficient tumbling window operator, as the sampling is done via math /
> group-by aggregation instead of window spec
>   b. if there is no row in Trades between 12:00pm to 2:00pm, how do we
> tell the system to still generate 0 counts for the time moments: 12:11pm,
> 12:22pm, 12:33pm, etc.? Or, those rows are delayed in the delivery in the
> system and user wants to ignore late-arrival of messages after 5min timeout
> to close the counting window? How can we support that use case w/o breaking
> SQL grammar?
> 
> Both the above issues seem to require some extension to the window spec in
> SQL grammar. Julian, what do you think? Is it creating too many
> language/parser/planner problems in SQL?


Mime
View raw message