calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: About Stream SQL
Date Fri, 05 Feb 2016 23:46:43 GMT
Hi,

first of all, thanks for starting this discussion. As Stephan said before,
the Flink community is working towards support for SQL on streams. IMO, it
would be very nice if the different efforts for SQL on streams could
converge towards a core set of semantics and syntax.

I read the proposal on the Calcite website [1] and the mail thread about
the TUMBLE and HOP functions [2]. IMO, the discussed windowing semantics
are well defined and could serve as a common basis. The improved syntax for
tumbling and hopping windows is very nice (would be good to update the
Calcite website with this). More concise definitions for sliding and
cascading windows would be great, too.

One concern that I have is the requirement of a monotonic attribute. Such
an attribute might be present in use cases where events are generated at a
central source, such as transactions of a database system. However, events
that originate from distributed sources such as sensors, log events of
cluster machines, etc, do not arrive in timestamp order at a stream
processor. In fact, the majority of use cases from Flink users has to deal
with out-of-order events. Flink (and Google Cloud Dataflow / Apache Beam
incubating) use the notion of event time and watermarks [3][4] to handle
streams with out-of-order events. While watermarks can help a lot to
improve the consistency of results, it is not possible to guarantee that no
late events arrive at an operator. There are different strategies to deal
with late arriving events (dropping, recomputation of aggregates, ...), but
usually they have a negative effect on the result correctness and/or a
downstream systems needs to deal with them, e.g., updating previous results.

Relaxing the requirement for monotonic attributes to attributes with
watermarks (watermarks are provided by the data source), would have no
impact on the corrects of queries over tables with an ordered attribute
(since there would be no late arriving events). It would also not change
the semantics of queries over streams with a monotonic attribute, but would
also allow for streams with (slightly) out-of-order arriving events at the
cost that query results may become inconsistent in case of late arriving
results.

I agree that defining use cases and queries over some sample data would be
a good way to start.

Best, Fabian

[1] http://calcite.apache.org/docs/stream.html
[2]
http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
<javascript:void(0)>
[3]
http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
[4]
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/index.html#working-with-time

2016-02-05 11:29 GMT+01:00 Julian Hyde <jhyde@apache.org>:

> Stephan,
>
> I agree that we have a long way to go to get to a standard, but I
> think that means that we should start as soon as possible. By which I
> mean, rather than going ahead and creating Flink-specific extensions,
> let's have discussions about SQL extensions in a broad forum, pulling
> in members of other projects. Streaming/CEP is not a new field, so the
> use cases are well known.
>
> It's true that Calcite's streaming SQL doesn't go far beyond standard
> SQL. I don't want to diverge too far; and besides, one can accomplish
> a lot with each feature added to the language. What is needed are a
> very few well chosen deep features; then we can add liberal syntactic
> sugar on top of these to to make common use cases more concise.
>
> For the record, HOP and TUMBLE described in [1] are not OLAP sliding
> windows; they go in the GROUP BY clause. But unlike GROUP BY, they
> allow each row to contribute to more than one sub-total. This is novel
> in SQL (only GROUPING SETS allows this, and in a limited form) and
> could be the basis for user-defined windows.
>
> Also, our sliding windows can be defined by row count as well as by
> time. For example, suppose you want to calculate the length of a
> sport's team's streak of consecutive wins or losses. You can partition
> by N, where N is the number of state changes of the win-loss variable,
> and so each switch from win-to-lose or lose-to-win starts a new
> window.
>
> As you define your extensions to SQL, I strongly suggest that you make
> explicit, as columns, any data values that drive behavior. These might
> include event times, arrival times, processing times, flush
> directives, and any required orderings. If information is not
> explicit, we can not reason about the query as algebra, and the
> planner cannot more radical plans, such as sorting (within a window)
> or changing how the rows are partitioned across parallel processors. A
> litmus test is whether a database can apply the same SQL to the data
> archived from the stream and to achieve the same results.
>
> I saw that Fabian blogged recently[2] about stream windows in Flink.
> Could we start the process by trying to convert some use cases
> (expressed in English, and with sample input and output data) into
> SQL? Then we can iterate to make the SQL concise, understandable, and
> well-defined.
>
> Julian
>
> [1]
> http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
>
> [2] https://flink.apache.org/news/2015/12/04/Introducing-windows.html
>
> On Thu, Feb 4, 2016 at 1:35 AM, Stephan Ewen <sewen@apache.org> wrote:
> > Hi!
> >
> > True, the Flink community is looking into stream SQL, and is currently
> > building on top of Calcite. This is all going well, but we probably need
> > some custom syntax around windowing.
> >
> > For Stream SQL Windowing, what I have seen so far in Calcite (correct me
> if
> > I am wrong there), is pretty much a variant of the OLAP sliding window
> > aggregates.
> >
> >   - Windows are in those basically calculated by rounding down/up
> > timestamps, thus bucketizing the events. That works for many cases, but
> is
> > quite tricky syntax.
> >
> >   - Flink supports various notions of time for windowing (processing
> time,
> > ingestion time, event time), as well as triggers. To be able to extend
> the
> > window specification with such additional parameters is pretty crucial
> and
> > would probably go well with a dedicated window clause.
> >
> >   - Flink also has unaligned windows (sessions, timeouts, ...) which are
> > very hard to map to grouping and window aggregations across ordered
> groups.
> >
> >
> > Converging to a core standard around stream SQL is very desirable, I
> > completely agree.
> > For the basic constructs, I think this is quite feasible and Calcite has
> > some good suggestions there.
> >
> > In the advanced constructs, the systems differ quite heavily currently,
> so
> > converging there may be harder there. Also, we are just learning what
> > semantics people need concerning windowing/event time/etc. May almost be
> a
> > tad bit too early to try and define a standard there...
> >
> >
> > Greetings,
> > Stephan
> >
> >
> > On Thu, Feb 4, 2016 at 9:35 AM, Julian Hyde <jhyde@apache.org> wrote:
> >
> >> I totally agree with you. (Sorry for the delayed response; this week has
> >> been very busy.)
> >>
> >> There is a tendency of vendors (and projects) to think that their
> >> technology is unique, and superior to everyone else’s, and want to
> showcase
> >> it in their dialect of SQL. That is natural, and it’s OK, since it makes
> >> them strive to make their technology better.
> >>
> >> However, they have to remember that the end users don’t want something
> >> unique, they want something that solves their problem. They would like
> >> something that is standards compliant so that it is easy to learn, easy
> to
> >> hire developers for, and — if the worst comes to the worst — easy to
> >> migrate to a compatible competing technology.
> >>
> >> I know the developers at Storm and Flink (and Samza too) and they
> >> understand the importance of collaborating on a standard.
> >>
> >> I have been trying to play a dual role: supplying the parser and planner
> >> for streaming SQL, and also to facilitate the creation of a standard
> >> language and semantics of streaming SQL. For the latter, see Streaming
> page
> >> on Calcite’s web site[1]. On that page, I intend to illustrate all of
> the
> >> main patterns of streaming queries, give them names (e.g. “Tumbling
> >> windows”), and show how those translate into streaming SQL.
> >>
> >> Also, it would be useful to create a reference implementation of
> streaming
> >> SQL in Calcite so that you can validate and run queries. The
> performance,
> >> scalability and reliability will not be the same as if you ran Storm,
> Flink
> >> or Samza, but at least you can see what the semantics should be.
> >>
> >> I believe that most, if not all, of the examples that the projects are
> >> coming up with can be translated into SQL. It will be challenging,
> because
> >> we want to preserve the semantics of SQL, allow streaming SQL to
> >> interoperate with traditional relations, and also retain the general
> look
> >> and feel of SQL. (For example, I fought quite hard[2] recently for the
> >> principle that GROUP BY defines a partition (in the set-theory sense)[3]
> >> and therefore could not be used to represent a tumbling window, until I
> >> remembered that GROUPING SETS already allows each input row to appear in
> >> more than one output sub-total.)
> >>
> >> What can you, the users, do? Get involved in the discussion about what
> you
> >> want in the language. Encourage the projects to bring their proposed SQL
> >> features into this forum for discussion, and add to the list of patterns
> >> and examples on the Streaming page. As in any standards process, the
> users
> >> help to keep the vendors focused.
> >>
> >> I’ll be talking about streaming SQL, planning, and standardization at
> the
> >> Samza meetup in 2 weeks[4], so if any of you are in the Bay Area, please
> >> stop by.
> >>
> >> Julian
> >>
> >> [1] http://calcite.apache.org/docs/stream.html
> >>
> >> [2]
> >>
> http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=hE0aryWinBtovGwbddQ@mail.gmail.com%3E
> >>
> >> [3] https://en.wikipedia.org/wiki/Partition_of_a_set
> >>
> >> [4] http://www.meetup.com/Bay-Area-Samza-Meetup/events/228430492/
> >>
> >> > On Jan 29, 2016, at 10:29 PM, Wanglan (Lan) <lan.wanglan@huawei.com>
> >> wrote:
> >> >
> >> > Hi to all,
> >> >
> >> > I am from Huawei and am focusing on data stream processing.
> >> > Recently I noticed that both in Storm community and Flink community
> >> there are endeavors to user Calcite as SQL parser to enable Storm/Flink
> to
> >> support SQL. They both want to supplemented or clarify Streaming SQL of
> >> calcite, especially the definition of windows.
> >> > I am considering if both communities working on designing Stream SQL
> >> syntax separately, there would come out two different syntaxes which
> >> represent the same use case.
> >> > Therefore, I am wondering if it is possible to unify such work, i.e.
> >> design and compliment the calcite Streaming SQL to enrich window
> definition
> >> so that both storm and flink can reuse the calcite(Streaming SQL) as
> their
> >> SQL parser for streaming cases with little change.
> >> > What do you think about this idea?
> >> >
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message