calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: About Stream SQL
Date Tue, 23 Feb 2016 02:44:54 GMT
I’ve updated the Streaming reference guide as Fabian requested: http://calcite.apache.org/docs/stream.html
<http://calcite.apache.org/docs/stream.html>

Julian

> On Feb 19, 2016, at 3:11 PM, Julian Hyde <jhyde@apache.org> wrote:
> 
> I gave a talk about streaming SQL at a Samza meetup. A lot of it is about the semantics
of streaming SQL, and I cover some ground that I don’t cover in the streams page[1].
> 
> The news item[2] gets you to both slides and video.
> 
> In other news, I notice[3] that Spark 2.1 will contain “continuous SQL”. If the examples[4]
are accurate, all queries are heavily based on sliding windows, and they use a syntax for
those windows that is very different to standard SQL.  I think we can deal with their use
cases, and in my opinion our proposed syntax is more elegant and closer to the standard. But
we should discuss. I don’t want to diverge from other efforts because of hubris/ignorance.
> 
> At the Samza meetup some folks mentioned the use case of a stream that summarizes, emitting
periodic totals even if there were no data in a given period. Can they re-state that use case
here, so we can discuss?
> 
> Julian
> 
> [1] http://calcite.apache.org/docs/stream.html
> 
> [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/
> 
> [3] http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411 slide
29
> 
> [4] https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf

> 
>> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <mpathira@umail.iu.edu> wrote:
>> 
>> Hi Fabian,
>> 
>> We did some work on stream joins [1]. I tested stream-to-relation joins
>> with Samza. But not stream-to-stream joins. But never updated the streaming
>> documentation. I'll send a pull request with some documentation on joins.
>> 
>> Thanks
>> Milinda
>> 
>> [1] https://issues.apache.org/jira/browse/CALCITE-968
>> 
>> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <fhueske@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I agree, the Streaming page is a very good starting point for this
>>> discussion. As suggested by Julian, I created CALCITE-1090 to update the
>>> page such that it reflects the current state of the discussion (adding HOP
>>> and TUMBLE functions, punctuations). I can also help with that, e.g., by
>>> contributing figures, examples, or text, reviewing, or any other way.
>>> 
>>> From my point of view, the semantics of the window types and the other
>>> operators in the Streaming document are very good. What is missing are
>>> joins (windowed stream-stream joins, stream-table joins, stream-table joins
>>> with table updates) as already noted in the todo section.
>>> 
>>> Regarding the handling of late-arriving events, I am not sure if this is a
>>> purely QoS issue as the result of a query might depend on the chosen
>>> strategy. Also users might want to pick different strategies for different
>>> operations, so late-arriver strategies are not necessarily end-to-end but
>>> can be operator specific. However, I think these details should be
>>> discussed in a separate thread.
>>> 
>>> I'd like to add a few words about the StreamSQL roadmap of the Flink
>>> community.
>>> We are currently preparing our codebase and will start to work on support
>>> for structured queries on streams in the next weeks. Flink will support two
>>> query interface, a SQL interface and a LINQ-style Table API [1]. Both will
>>> be optimized and translated by Calcite. As a first step, we want to add
>>> simple stream transformations such as selection and projection to both
>>> interfaces. Next, we will add windowing support (starting with tumbling and
>>> hopping windows) to the Table API (as is said before, our plans here are
>>> well aligned with Julian's suggestions). Once this is done, we would extend
>>> the SQL interface to support windows which is hopefully as simple as using
>>> a parser that accepts window syntax.
>>> 
>>> So from our point of view, fixing the semantics of windows and extending
>>> the optimizer accordingly is more urgent than agreeing on a syntax
>>> (although the Table API syntax could be inspired by Calcite's StreamSQL
>>> syntax [2]). I can also help implementing the missing features in Calcite.
>>> 
>>> Having a reference implementation with tests would be awesome and
>>> definitely help.
>>> 
>>> Best, Fabian
>>> 
>>> [1]
>>> 
>>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
>>> [2]
>>> 
>>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
>>> 
>>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <jhyde@apache.org>:
>>> 
>>>> Fabian,
>>>> 
>>>> Apologies for the late reply.
>>>> 
>>>> I would rather that the specification for streaming SQL was not too
>>>> prescriptive for how late events were handled. Approaches 1, 2 and 3 are
>>>> all viable, and engines can differentiate themselves by the strength of
>>>> their support for this. But for the SQL to be considered valid, I think
>>> the
>>>> validator just needs to know that it can make progress.
>>>> 
>>>> There is a large area of functionality I’d call “quality of service”
>>>> (QoS). This includes latency, reliability, at-least-once, at-most-once or
>>>> in-order guarantees, as well as the late-row-handling this thread is
>>>> concerned with. What the QoS metrics have in common is that they are
>>>> end-to-end. To deliver a high QoS to the consumer, the producer needs to
>>>> conform to a high QoS. The QoS is beyond the control of the SQL
>>> statement.
>>>> (Although you can ask what a SQL statement is able to deliver, given the
>>>> upstream QoS guarantees.) QoS is best managed by the whole system, and in
>>>> my opinion this is the biggest reason to have a DSMS.
>>>> 
>>>> For this reason, I would be inclined to put QoS constraints on the stream
>>>> definition, not on the query. For example, taking latency as the QoS
>>> metric
>>>> of interest, you could flag the Orders stream as “at most 10 ms latency
>>>> between the record’s timestamp and the wall-clock time of the server
>>>> receiving the records, and any records arriving after that time are
>>> logged
>>>> and discarded”, and that QoS constraint applies to both producers and
>>>> consumers.
>>>> 
>>>> Given a query Q ‘select stream * from Orders’, it is valid to ask “what
>>> is
>>>> the expected latency of Q?” or tell the planner “produce an
>>> implementation
>>>> of Q with a latency of no more than 15 ms, and if you cannot achieve that
>>>> latency, fail”. You could even register Q in the system and tell the
>>> system
>>>> to tighten up the latency of any upstream streams and the standing
>>> queries
>>>> that populate them. But it’s not valid to say “execute Q with a latency
>>> of
>>>> 15 ms”: the system may not be able to achieve it.
>>>> 
>>>> In summary: I would allow latency and late-row-handling and other QoS
>>>> annotations in the query but it’s not the most natural or powerful place
>>> to
>>>> put them.
>>>> 
>>>> Julian
>>>> 
>>>> 
>>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fhueske@gmail.com> wrote:
>>>>> 
>>>>> Excellent! I missed the punctuations in the todo list.
>>>>> 
>>>>> What kind of strategies do you have in mind to handle events that
>>> arrive
>>>>> too late? I see
>>>>> 1. dropping of late events
>>>>> 2. computing an updated window result for each late arriving
>>>>> element (implies that the window state is stored for a certain period
>>>>> before it is discarded)
>>>>> 3. computing a delta to the previous window result for each late
>>> arriving
>>>>> element (requires window state as well, not applicable to all
>>> aggregation
>>>>> types)
>>>>> 
>>>>> It would be nice if strategies to handle late-arrivers could be defined
>>>> in
>>>>> the query.
>>>>> 
>>>>> I think the plans of the Flink community are quite well aligned with
>>> your
>>>>> ideas for SQL on Streams.
>>>>> Should we start by updating / extending the Stream document on the
>>>> Calcite
>>>>> website to include the new window definitions (TUMBLE, HOP) and a
>>>>> discussion of punctuations/watermarks/time bounds?
>>>>> 
>>>>> Fabian
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jhyde@apache.org>:
>>>>> 
>>>>>> Let me rephrase: The *majority* of the literature, of which I cited
>>>>>> just one example, calls them punctuation, and a couple of recent
>>>>>> papers out of Mountain View doesn't change that.
>>>>>> 
>>>>>> There are some fine distinctions between punctuation, heartbeats,
>>>>>> watermarks and rowtime bounds, mostly in terms of how they are
>>>>>> generated and propagated, that matter little when planning the query.
>>>>>> 
>>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <ted.dunning@gmail.com>
>>>> wrote:
>>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <jhyde@apache.org>
>>> wrote:
>>>>>>> 
>>>>>>>> Yes, watermarks, absolutely. The "to do" list has "punctuation",
>>> which
>>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime
bound"
>>>>>>>> because it is feels more like a dynamic constraint than a
piece of
>>>>>>>> data, but the literature[1] calls them punctuation.)
>>>>>>>> 
>>>>>>> 
>>>>>>> Some of the literature calls them punctuation, other literature
[1]
>>>> calls
>>>>>>> them watermarks.
>>>>>>> 
>>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Milinda Pathirage
>> 
>> PhD Student | Research Assistant
>> School of Informatics and Computing | Data to Insight Center
>> Indiana University
>> 
>> twitter: milindalakmal
>> skype: milinda.pathirage
>> blog: http://milinda.pathirage.org
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message