calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: About Stream SQL
Date Wed, 17 Feb 2016 10:09:52 GMT
Hi,

I agree, the Streaming page is a very good starting point for this
discussion. As suggested by Julian, I created CALCITE-1090 to update the
page such that it reflects the current state of the discussion (adding HOP
and TUMBLE functions, punctuations). I can also help with that, e.g., by
contributing figures, examples, or text, reviewing, or any other way.

>From my point of view, the semantics of the window types and the other
operators in the Streaming document are very good. What is missing are
joins (windowed stream-stream joins, stream-table joins, stream-table joins
with table updates) as already noted in the todo section.

Regarding the handling of late-arriving events, I am not sure if this is a
purely QoS issue as the result of a query might depend on the chosen
strategy. Also users might want to pick different strategies for different
operations, so late-arriver strategies are not necessarily end-to-end but
can be operator specific. However, I think these details should be
discussed in a separate thread.

I'd like to add a few words about the StreamSQL roadmap of the Flink
community.
We are currently preparing our codebase and will start to work on support
for structured queries on streams in the next weeks. Flink will support two
query interface, a SQL interface and a LINQ-style Table API [1]. Both will
be optimized and translated by Calcite. As a first step, we want to add
simple stream transformations such as selection and projection to both
interfaces. Next, we will add windowing support (starting with tumbling and
hopping windows) to the Table API (as is said before, our plans here are
well aligned with Julian's suggestions). Once this is done, we would extend
the SQL interface to support windows which is hopefully as simple as using
a parser that accepts window syntax.

So from our point of view, fixing the semantics of windows and extending
the optimizer accordingly is more urgent than agreeing on a syntax
(although the Table API syntax could be inspired by Calcite's StreamSQL
syntax [2]). I can also help implementing the missing features in Calcite.

Having a reference implementation with tests would be awesome and
definitely help.

Best, Fabian

[1]
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
[2]
https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E

2016-02-14 21:23 GMT+01:00 Julian Hyde <jhyde@apache.org>:

> Fabian,
>
> Apologies for the late reply.
>
> I would rather that the specification for streaming SQL was not too
> prescriptive for how late events were handled. Approaches 1, 2 and 3 are
> all viable, and engines can differentiate themselves by the strength of
> their support for this. But for the SQL to be considered valid, I think the
> validator just needs to know that it can make progress.
>
> There is a large area of functionality I’d call “quality of service”
> (QoS). This includes latency, reliability, at-least-once, at-most-once or
> in-order guarantees, as well as the late-row-handling this thread is
> concerned with. What the QoS metrics have in common is that they are
> end-to-end. To deliver a high QoS to the consumer, the producer needs to
> conform to a high QoS. The QoS is beyond the control of the SQL statement.
> (Although you can ask what a SQL statement is able to deliver, given the
> upstream QoS guarantees.) QoS is best managed by the whole system, and in
> my opinion this is the biggest reason to have a DSMS.
>
> For this reason, I would be inclined to put QoS constraints on the stream
> definition, not on the query. For example, taking latency as the QoS metric
> of interest, you could flag the Orders stream as “at most 10 ms latency
> between the record’s timestamp and the wall-clock time of the server
> receiving the records, and any records arriving after that time are logged
> and discarded”, and that QoS constraint applies to both producers and
> consumers.
>
> Given a query Q ‘select stream * from Orders’, it is valid to ask “what is
> the expected latency of Q?” or tell the planner “produce an implementation
> of Q with a latency of no more than 15 ms, and if you cannot achieve that
> latency, fail”. You could even register Q in the system and tell the system
> to tighten up the latency of any upstream streams and the standing queries
> that populate them. But it’s not valid to say “execute Q with a latency of
> 15 ms”: the system may not be able to achieve it.
>
> In summary: I would allow latency and late-row-handling and other QoS
> annotations in the query but it’s not the most natural or powerful place to
> put them.
>
> Julian
>
>
> > On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fhueske@gmail.com> wrote:
> >
> > Excellent! I missed the punctuations in the todo list.
> >
> > What kind of strategies do you have in mind to handle events that arrive
> > too late? I see
> > 1. dropping of late events
> > 2. computing an updated window result for each late arriving
> > element (implies that the window state is stored for a certain period
> > before it is discarded)
> > 3. computing a delta to the previous window result for each late arriving
> > element (requires window state as well, not applicable to all aggregation
> > types)
> >
> > It would be nice if strategies to handle late-arrivers could be defined
> in
> > the query.
> >
> > I think the plans of the Flink community are quite well aligned with your
> > ideas for SQL on Streams.
> > Should we start by updating / extending the Stream document on the
> Calcite
> > website to include the new window definitions (TUMBLE, HOP) and a
> > discussion of punctuations/watermarks/time bounds?
> >
> > Fabian
> >
> >
> >
> >
> >
> >
> > 2016-02-06 2:35 GMT+01:00 Julian Hyde <jhyde@apache.org>:
> >
> >> Let me rephrase: The *majority* of the literature, of which I cited
> >> just one example, calls them punctuation, and a couple of recent
> >> papers out of Mountain View doesn't change that.
> >>
> >> There are some fine distinctions between punctuation, heartbeats,
> >> watermarks and rowtime bounds, mostly in terms of how they are
> >> generated and propagated, that matter little when planning the query.
> >>
> >> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jhyde@apache.org> wrote:
> >>>
> >>>> Yes, watermarks, absolutely. The "to do" list has "punctuation", which
> >>>> is the same thing. (Actually, I prefer to call it "rowtime bound"
> >>>> because it is feels more like a dynamic constraint than a piece of
> >>>> data, but the literature[1] calls them punctuation.)
> >>>>
> >>>
> >>> Some of the literature calls them punctuation, other literature [1]
> calls
> >>> them watermarks.
> >>>
> >>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message