calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: About Stream SQL
Date Fri, 19 Feb 2016 23:11:43 GMT
I gave a talk about streaming SQL at a Samza meetup. A lot of it is about the semantics of
streaming SQL, and I cover some ground that I don’t cover in the streams page[1].

The news item[2] gets you to both slides and video.

In other news, I notice[3] that Spark 2.1 will contain “continuous SQL”. If the examples[4]
are accurate, all queries are heavily based on sliding windows, and they use a syntax for
those windows that is very different to standard SQL.  I think we can deal with their use
cases, and in my opinion our proposed syntax is more elegant and closer to the standard. But
we should discuss. I don’t want to diverge from other efforts because of hubris/ignorance.

At the Samza meetup some folks mentioned the use case of a stream that summarizes, emitting
periodic totals even if there were no data in a given period. Can they re-state that use case
here, so we can discuss?

Julian

[1] http://calcite.apache.org/docs/stream.html

[2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/

[3] http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411 slide 29

[4] https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf


> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <mpathira@umail.iu.edu> wrote:
> 
> Hi Fabian,
> 
> We did some work on stream joins [1]. I tested stream-to-relation joins
> with Samza. But not stream-to-stream joins. But never updated the streaming
> documentation. I'll send a pull request with some documentation on joins.
> 
> Thanks
> Milinda
> 
> [1] https://issues.apache.org/jira/browse/CALCITE-968
> 
> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <fhueske@gmail.com> wrote:
> 
>> Hi,
>> 
>> I agree, the Streaming page is a very good starting point for this
>> discussion. As suggested by Julian, I created CALCITE-1090 to update the
>> page such that it reflects the current state of the discussion (adding HOP
>> and TUMBLE functions, punctuations). I can also help with that, e.g., by
>> contributing figures, examples, or text, reviewing, or any other way.
>> 
>> From my point of view, the semantics of the window types and the other
>> operators in the Streaming document are very good. What is missing are
>> joins (windowed stream-stream joins, stream-table joins, stream-table joins
>> with table updates) as already noted in the todo section.
>> 
>> Regarding the handling of late-arriving events, I am not sure if this is a
>> purely QoS issue as the result of a query might depend on the chosen
>> strategy. Also users might want to pick different strategies for different
>> operations, so late-arriver strategies are not necessarily end-to-end but
>> can be operator specific. However, I think these details should be
>> discussed in a separate thread.
>> 
>> I'd like to add a few words about the StreamSQL roadmap of the Flink
>> community.
>> We are currently preparing our codebase and will start to work on support
>> for structured queries on streams in the next weeks. Flink will support two
>> query interface, a SQL interface and a LINQ-style Table API [1]. Both will
>> be optimized and translated by Calcite. As a first step, we want to add
>> simple stream transformations such as selection and projection to both
>> interfaces. Next, we will add windowing support (starting with tumbling and
>> hopping windows) to the Table API (as is said before, our plans here are
>> well aligned with Julian's suggestions). Once this is done, we would extend
>> the SQL interface to support windows which is hopefully as simple as using
>> a parser that accepts window syntax.
>> 
>> So from our point of view, fixing the semantics of windows and extending
>> the optimizer accordingly is more urgent than agreeing on a syntax
>> (although the Table API syntax could be inspired by Calcite's StreamSQL
>> syntax [2]). I can also help implementing the missing features in Calcite.
>> 
>> Having a reference implementation with tests would be awesome and
>> definitely help.
>> 
>> Best, Fabian
>> 
>> [1]
>> 
>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
>> [2]
>> 
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
>> 
>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <jhyde@apache.org>:
>> 
>>> Fabian,
>>> 
>>> Apologies for the late reply.
>>> 
>>> I would rather that the specification for streaming SQL was not too
>>> prescriptive for how late events were handled. Approaches 1, 2 and 3 are
>>> all viable, and engines can differentiate themselves by the strength of
>>> their support for this. But for the SQL to be considered valid, I think
>> the
>>> validator just needs to know that it can make progress.
>>> 
>>> There is a large area of functionality I’d call “quality of service”
>>> (QoS). This includes latency, reliability, at-least-once, at-most-once or
>>> in-order guarantees, as well as the late-row-handling this thread is
>>> concerned with. What the QoS metrics have in common is that they are
>>> end-to-end. To deliver a high QoS to the consumer, the producer needs to
>>> conform to a high QoS. The QoS is beyond the control of the SQL
>> statement.
>>> (Although you can ask what a SQL statement is able to deliver, given the
>>> upstream QoS guarantees.) QoS is best managed by the whole system, and in
>>> my opinion this is the biggest reason to have a DSMS.
>>> 
>>> For this reason, I would be inclined to put QoS constraints on the stream
>>> definition, not on the query. For example, taking latency as the QoS
>> metric
>>> of interest, you could flag the Orders stream as “at most 10 ms latency
>>> between the record’s timestamp and the wall-clock time of the server
>>> receiving the records, and any records arriving after that time are
>> logged
>>> and discarded”, and that QoS constraint applies to both producers and
>>> consumers.
>>> 
>>> Given a query Q ‘select stream * from Orders’, it is valid to ask “what
>> is
>>> the expected latency of Q?” or tell the planner “produce an
>> implementation
>>> of Q with a latency of no more than 15 ms, and if you cannot achieve that
>>> latency, fail”. You could even register Q in the system and tell the
>> system
>>> to tighten up the latency of any upstream streams and the standing
>> queries
>>> that populate them. But it’s not valid to say “execute Q with a latency
>> of
>>> 15 ms”: the system may not be able to achieve it.
>>> 
>>> In summary: I would allow latency and late-row-handling and other QoS
>>> annotations in the query but it’s not the most natural or powerful place
>> to
>>> put them.
>>> 
>>> Julian
>>> 
>>> 
>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fhueske@gmail.com> wrote:
>>>> 
>>>> Excellent! I missed the punctuations in the todo list.
>>>> 
>>>> What kind of strategies do you have in mind to handle events that
>> arrive
>>>> too late? I see
>>>> 1. dropping of late events
>>>> 2. computing an updated window result for each late arriving
>>>> element (implies that the window state is stored for a certain period
>>>> before it is discarded)
>>>> 3. computing a delta to the previous window result for each late
>> arriving
>>>> element (requires window state as well, not applicable to all
>> aggregation
>>>> types)
>>>> 
>>>> It would be nice if strategies to handle late-arrivers could be defined
>>> in
>>>> the query.
>>>> 
>>>> I think the plans of the Flink community are quite well aligned with
>> your
>>>> ideas for SQL on Streams.
>>>> Should we start by updating / extending the Stream document on the
>>> Calcite
>>>> website to include the new window definitions (TUMBLE, HOP) and a
>>>> discussion of punctuations/watermarks/time bounds?
>>>> 
>>>> Fabian
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jhyde@apache.org>:
>>>> 
>>>>> Let me rephrase: The *majority* of the literature, of which I cited
>>>>> just one example, calls them punctuation, and a couple of recent
>>>>> papers out of Mountain View doesn't change that.
>>>>> 
>>>>> There are some fine distinctions between punctuation, heartbeats,
>>>>> watermarks and rowtime bounds, mostly in terms of how they are
>>>>> generated and propagated, that matter little when planning the query.
>>>>> 
>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <ted.dunning@gmail.com>
>>> wrote:
>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jhyde@apache.org>
>> wrote:
>>>>>> 
>>>>>>> Yes, watermarks, absolutely. The "to do" list has "punctuation",
>> which
>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime bound"
>>>>>>> because it is feels more like a dynamic constraint than a piece
of
>>>>>>> data, but the literature[1] calls them punctuation.)
>>>>>>> 
>>>>>> 
>>>>>> Some of the literature calls them punctuation, other literature [1]
>>> calls
>>>>>> them watermarks.
>>>>>> 
>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Milinda Pathirage
> 
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
> 
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org


Mime
View raw message