samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jul...@hydromatic.net>
Subject Re: A question regarding to the default semantic meaning of join
Date Sat, 07 Mar 2015 18:57:52 GMT
First of all, if you want a stream output, you should add the 'STREAM'
keyword after 'SELECT'.

There isn't quite enough information to make the query well-defined. I
need to assume that each stream has an increasing rowtime column, and
that the "OVER (ROWS 3 PRECEDING)" windows on each stream are intended
to be linked on rowtime.

I'd rather that the system doesn't assume anything, and that these
linkages are explicit in either the stream definition or the query,
but let's take it as a given for now.

This query would not be equivalent to join queries with different
windows. Consider the following queries:

(1) semi-infinite x semi-infinite

SELECT STREAM id, value, cost
FROM Orders OVER (ORDER BY rowtime ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW)
JOIN Shipments OVER (ORDER BY rowtime ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW)
ON Orders.id = Shipments.id

(2) semi-infinite x 1

SELECT STREAM id, value, cost
FROM Orders OVER (ORDER BY rowtime  BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
JOIN Shipments OVER (ORDER BY rowtime ROWS CURRENT ROW)
ON Orders.id = Shipments.id

(3) 3 x 1

SELECT STREAM id, value, cost
FROM Orders OVER (ORDER BY rowtime  ROWS 3 PRECEDING)
JOIN Shipments OVER (ORDER BY rowtime ROWS CURRENT ROW)
ON Orders.id = Shipments.id

At any moment your 3 x 3 query will have up to 9 rows in the cartesian
product, and any given Orders row might match 3 Shipments, and any
given Shipments row might match 3 Orders. So, you get a multiplicative
effect.

And query (1) semi-infinite x semi-infinite could, if id is not unique
in orders and shipments, produce an even bigger multiplier.

For people writing practical queries I would only recommend having a
multi-row window on one side of the join, and/or make sure that their
id column is unique on one side of the join.

To summarize: if we make assumptions about joining on rowtime, then
your query and queries (1), (2), (3) are valid, but due to their
different windows are not equivalent.

Julian

On Fri, Mar 6, 2015 at 4:28 PM, Yi Pan <nickpan47@gmail.com> wrote:
> Hi, Julian,
>
> I am writing down some detailed examples of join and need your further help
> in understanding the semantic meaning of the following example:
>
> SELECT id, value, cost FROM Orders OVER (ROWS 3 PRECEDING) JOIN Shipments
> OVER (ROWS 3 PROCEDING) ON Orders.id = Shipments.id
>
> In this example, if id is not the row number, what would be the default
> semantic meaning of the join? Does it mean that each sliding window in
> Orders now need to join with each sliding window in Shipments? If that's
> the case, that would be equivalent to removing all the windows and just do
> a full stream-to-stream join, which is not possible for infinite streams.
> Hence, does not make sense to define window in the query here. On the other
> hand, if the semantic of the join is to join the "current windows" from
> both streams, how do we align the windows from two streams? Because in this
> case, when the query starts from two streams now determines the results.
>
> Does the above example make sense in SQL? Or should we just declare that
> this is not supported as joins of two streams w/o time-bound?
>
> Thanks!
>
> -Yi

Mime
View raw message