calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: About Stream SQL
Date Tue, 23 Feb 2016 18:25:25 GMT
Sorry I misunderstood. As for ways to tell the system that it can make
progress, the more the merrier. There's not a "best" mechanism. It
depends on the business problem. A good engine should support several,
including fully-ordered columns, punctuation, and slack, and let users
chose on a per-stream (per-topic) or even per-query basis.

On Tue, Feb 23, 2016 at 9:03 AM, Milinda Pathirage
<mpathira@umail.iu.edu> wrote:
> Hi Julian,
>
> I agree with you. Calcite should stay away from physical properties of
> stream as much as possible. I was just trying to clarify the confusion
> regarding the punctuations and watermarks. My last question was not related
> to Calcite, rather to Flink and other implementations. Sorry for the
> confusion.
>
> Milinda
>
> On Tue, Feb 23, 2016 at 11:41 AM, Julian Hyde <jhyde@apache.org> wrote:
>
>> As the author of the streaming SQL specification, I don't care at all
>> how the system deduces that it is able to make progress. Just as the
>> authors of the SQL standard don't care whether a vendor chooses to
>> store records sorted and/or compressed.
>>
>> All the streaming SQL validator/optimizer needs to know is that, say,
>> orderTime is monotonic, orderId is quasi-monotonic, and paymentMethod
>> is non-monotonic, so it can allow streaming aggregations on orderTime
>> and orderId, and disallow them on paymentMethod.
>>
>> This allows streaming engines to add novel mechanisms without having
>> to change the definition of streaming SQL.
>>
>>
>> On Tue, Feb 23, 2016 at 7:42 AM, Milinda Pathirage
>> <mpathira@umail.iu.edu> wrote:
>> > Thank you Julian for the document.
>> >
>> > [1] is also a good read on punctuation. What I understood from reading
>> [1]
>> > and MillWheel paper is that a low-watermark (or row-time bound) is a
>> > property maintained by operators and operators derive low-watermark by
>> > processing punctuations.
>> >
>> > One other thing mentioned in MillWheel is the fact that Google's input
>> > streams contain punctuations to communicate stream progress. If
>> > punctuations are not there in the input stream we will have to generate
>> > them during ingest based on a slack or some similar technique. What do
>> you
>> > think about this?
>> >
>> > Thanks
>> > Milinda
>> >
>> >
>> > [1] http://www.vldb.org/pvldb/1/1453890.pdf
>> >
>> > On Mon, Feb 22, 2016 at 9:44 PM, Julian Hyde <jhyde@apache.org> wrote:
>> >
>> >> I’ve updated the Streaming reference guide as Fabian requested:
>> >> http://calcite.apache.org/docs/stream.html <
>> >> http://calcite.apache.org/docs/stream.html>
>> >>
>> >> Julian
>> >>
>> >> > On Feb 19, 2016, at 3:11 PM, Julian Hyde <jhyde@apache.org> wrote:
>> >> >
>> >> > I gave a talk about streaming SQL at a Samza meetup. A lot of it is
>> >> about the semantics of streaming SQL, and I cover some ground that I
>> don’t
>> >> cover in the streams page[1].
>> >> >
>> >> > The news item[2] gets you to both slides and video.
>> >> >
>> >> > In other news, I notice[3] that Spark 2.1 will contain “continuous
>> SQL”.
>> >> If the examples[4] are accurate, all queries are heavily based on
>> sliding
>> >> windows, and they use a syntax for those windows that is very different
>> to
>> >> standard SQL.  I think we can deal with their use cases, and in my
>> opinion
>> >> our proposed syntax is more elegant and closer to the standard. But we
>> >> should discuss. I don’t want to diverge from other efforts because of
>> >> hubris/ignorance.
>> >> >
>> >> > At the Samza meetup some folks mentioned the use case of a stream that
>> >> summarizes, emitting periodic totals even if there were no data in a
>> given
>> >> period. Can they re-state that use case here, so we can discuss?
>> >> >
>> >> > Julian
>> >> >
>> >> > [1] http://calcite.apache.org/docs/stream.html
>> >> >
>> >> > [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/
>> >> >
>> >> > [3]
>> >>
>> http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411
>> >> slide 29
>> >> >
>> >> > [4]
>> >>
>> https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf
>> >> >
>> >> >> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <
>> mpathira@umail.iu.edu>
>> >> wrote:
>> >> >>
>> >> >> Hi Fabian,
>> >> >>
>> >> >> We did some work on stream joins [1]. I tested stream-to-relation
>> joins
>> >> >> with Samza. But not stream-to-stream joins. But never updated the
>> >> streaming
>> >> >> documentation. I'll send a pull request with some documentation
on
>> >> joins.
>> >> >>
>> >> >> Thanks
>> >> >> Milinda
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/CALCITE-968
>> >> >>
>> >> >> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <fhueske@gmail.com>
>> >> wrote:
>> >> >>
>> >> >>> Hi,
>> >> >>>
>> >> >>> I agree, the Streaming page is a very good starting point for
this
>> >> >>> discussion. As suggested by Julian, I created CALCITE-1090
to update
>> >> the
>> >> >>> page such that it reflects the current state of the discussion
>> (adding
>> >> HOP
>> >> >>> and TUMBLE functions, punctuations). I can also help with that,
>> e.g.,
>> >> by
>> >> >>> contributing figures, examples, or text, reviewing, or any
other
>> way.
>> >> >>>
>> >> >>> From my point of view, the semantics of the window types and
the
>> other
>> >> >>> operators in the Streaming document are very good. What is
missing
>> are
>> >> >>> joins (windowed stream-stream joins, stream-table joins,
>> stream-table
>> >> joins
>> >> >>> with table updates) as already noted in the todo section.
>> >> >>>
>> >> >>> Regarding the handling of late-arriving events, I am not sure
if
>> this
>> >> is a
>> >> >>> purely QoS issue as the result of a query might depend on the
chosen
>> >> >>> strategy. Also users might want to pick different strategies
for
>> >> different
>> >> >>> operations, so late-arriver strategies are not necessarily
>> end-to-end
>> >> but
>> >> >>> can be operator specific. However, I think these details should
be
>> >> >>> discussed in a separate thread.
>> >> >>>
>> >> >>> I'd like to add a few words about the StreamSQL roadmap of
the Flink
>> >> >>> community.
>> >> >>> We are currently preparing our codebase and will start to work
on
>> >> support
>> >> >>> for structured queries on streams in the next weeks. Flink
will
>> >> support two
>> >> >>> query interface, a SQL interface and a LINQ-style Table API
[1].
>> Both
>> >> will
>> >> >>> be optimized and translated by Calcite. As a first step, we
want to
>> add
>> >> >>> simple stream transformations such as selection and projection
to
>> both
>> >> >>> interfaces. Next, we will add windowing support (starting with
>> >> tumbling and
>> >> >>> hopping windows) to the Table API (as is said before, our plans
here
>> >> are
>> >> >>> well aligned with Julian's suggestions). Once this is done,
we would
>> >> extend
>> >> >>> the SQL interface to support windows which is hopefully as
simple as
>> >> using
>> >> >>> a parser that accepts window syntax.
>> >> >>>
>> >> >>> So from our point of view, fixing the semantics of windows
and
>> >> extending
>> >> >>> the optimizer accordingly is more urgent than agreeing on a
syntax
>> >> >>> (although the Table API syntax could be inspired by Calcite's
>> StreamSQL
>> >> >>> syntax [2]). I can also help implementing the missing features
in
>> >> Calcite.
>> >> >>>
>> >> >>> Having a reference implementation with tests would be awesome
and
>> >> >>> definitely help.
>> >> >>>
>> >> >>> Best, Fabian
>> >> >>>
>> >> >>> [1]
>> >> >>>
>> >> >>>
>> >>
>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
>> >> >>> [2]
>> >> >>>
>> >> >>>
>> >>
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
>> >> >>>
>> >> >>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <jhyde@apache.org>:
>> >> >>>
>> >> >>>> Fabian,
>> >> >>>>
>> >> >>>> Apologies for the late reply.
>> >> >>>>
>> >> >>>> I would rather that the specification for streaming SQL
was not too
>> >> >>>> prescriptive for how late events were handled. Approaches
1, 2 and
>> 3
>> >> are
>> >> >>>> all viable, and engines can differentiate themselves by
the
>> strength
>> >> of
>> >> >>>> their support for this. But for the SQL to be considered
valid, I
>> >> think
>> >> >>> the
>> >> >>>> validator just needs to know that it can make progress.
>> >> >>>>
>> >> >>>> There is a large area of functionality I’d call “quality
of
>> service”
>> >> >>>> (QoS). This includes latency, reliability, at-least-once,
>> >> at-most-once or
>> >> >>>> in-order guarantees, as well as the late-row-handling this
thread
>> is
>> >> >>>> concerned with. What the QoS metrics have in common is
that they
>> are
>> >> >>>> end-to-end. To deliver a high QoS to the consumer, the
producer
>> needs
>> >> to
>> >> >>>> conform to a high QoS. The QoS is beyond the control of
the SQL
>> >> >>> statement.
>> >> >>>> (Although you can ask what a SQL statement is able to deliver,
>> given
>> >> the
>> >> >>>> upstream QoS guarantees.) QoS is best managed by the whole
system,
>> >> and in
>> >> >>>> my opinion this is the biggest reason to have a DSMS.
>> >> >>>>
>> >> >>>> For this reason, I would be inclined to put QoS constraints
on the
>> >> stream
>> >> >>>> definition, not on the query. For example, taking latency
as the
>> QoS
>> >> >>> metric
>> >> >>>> of interest, you could flag the Orders stream as “at
most 10 ms
>> >> latency
>> >> >>>> between the record’s timestamp and the wall-clock time
of the
>> server
>> >> >>>> receiving the records, and any records arriving after that
time are
>> >> >>> logged
>> >> >>>> and discarded”, and that QoS constraint applies to both
producers
>> and
>> >> >>>> consumers.
>> >> >>>>
>> >> >>>> Given a query Q ‘select stream * from Orders’, it is
valid to ask
>> >> “what
>> >> >>> is
>> >> >>>> the expected latency of Q?” or tell the planner “produce
an
>> >> >>> implementation
>> >> >>>> of Q with a latency of no more than 15 ms, and if you cannot
>> achieve
>> >> that
>> >> >>>> latency, fail”. You could even register Q in the system
and tell
>> the
>> >> >>> system
>> >> >>>> to tighten up the latency of any upstream streams and the
standing
>> >> >>> queries
>> >> >>>> that populate them. But it’s not valid to say “execute
Q with a
>> >> latency
>> >> >>> of
>> >> >>>> 15 ms”: the system may not be able to achieve it.
>> >> >>>>
>> >> >>>> In summary: I would allow latency and late-row-handling
and other
>> QoS
>> >> >>>> annotations in the query but it’s not the most natural
or powerful
>> >> place
>> >> >>> to
>> >> >>>> put them.
>> >> >>>>
>> >> >>>> Julian
>> >> >>>>
>> >> >>>>
>> >> >>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <fhueske@gmail.com>
>> wrote:
>> >> >>>>>
>> >> >>>>> Excellent! I missed the punctuations in the todo list.
>> >> >>>>>
>> >> >>>>> What kind of strategies do you have in mind to handle
events that
>> >> >>> arrive
>> >> >>>>> too late? I see
>> >> >>>>> 1. dropping of late events
>> >> >>>>> 2. computing an updated window result for each late
arriving
>> >> >>>>> element (implies that the window state is stored for
a certain
>> period
>> >> >>>>> before it is discarded)
>> >> >>>>> 3. computing a delta to the previous window result
for each late
>> >> >>> arriving
>> >> >>>>> element (requires window state as well, not applicable
to all
>> >> >>> aggregation
>> >> >>>>> types)
>> >> >>>>>
>> >> >>>>> It would be nice if strategies to handle late-arrivers
could be
>> >> defined
>> >> >>>> in
>> >> >>>>> the query.
>> >> >>>>>
>> >> >>>>> I think the plans of the Flink community are quite
well aligned
>> with
>> >> >>> your
>> >> >>>>> ideas for SQL on Streams.
>> >> >>>>> Should we start by updating / extending the Stream
document on the
>> >> >>>> Calcite
>> >> >>>>> website to include the new window definitions (TUMBLE,
HOP) and a
>> >> >>>>> discussion of punctuations/watermarks/time bounds?
>> >> >>>>>
>> >> >>>>> Fabian
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <jhyde@apache.org>:
>> >> >>>>>
>> >> >>>>>> Let me rephrase: The *majority* of the literature,
of which I
>> cited
>> >> >>>>>> just one example, calls them punctuation, and a
couple of recent
>> >> >>>>>> papers out of Mountain View doesn't change that.
>> >> >>>>>>
>> >> >>>>>> There are some fine distinctions between punctuation,
heartbeats,
>> >> >>>>>> watermarks and rowtime bounds, mostly in terms
of how they are
>> >> >>>>>> generated and propagated, that matter little when
planning the
>> >> query.
>> >> >>>>>>
>> >> >>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <
>> ted.dunning@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde
<jhyde@apache.org>
>> >> >>> wrote:
>> >> >>>>>>>
>> >> >>>>>>>> Yes, watermarks, absolutely. The "to do"
list has
>> "punctuation",
>> >> >>> which
>> >> >>>>>>>> is the same thing. (Actually, I prefer
to call it "rowtime
>> bound"
>> >> >>>>>>>> because it is feels more like a dynamic
constraint than a
>> piece of
>> >> >>>>>>>> data, but the literature[1] calls them
punctuation.)
>> >> >>>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> Some of the literature calls them punctuation,
other literature
>> [1]
>> >> >>>> calls
>> >> >>>>>>> them watermarks.
>> >> >>>>>>>
>> >> >>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>> >> >>>>>>
>> >> >>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Milinda Pathirage
>> >> >>
>> >> >> PhD Student | Research Assistant
>> >> >> School of Informatics and Computing | Data to Insight Center
>> >> >> Indiana University
>> >> >>
>> >> >> twitter: milindalakmal
>> >> >> skype: milinda.pathirage
>> >> >> blog: http://milinda.pathirage.org
>> >> >
>> >>
>> >>
>> >
>> >
>> > --
>> > Milinda Pathirage
>> >
>> > PhD Student | Research Assistant
>> > School of Informatics and Computing | Data to Insight Center
>> > Indiana University
>> >
>> > twitter: milindalakmal
>> > skype: milinda.pathirage
>> > blog: http://milinda.pathirage.org
>>
>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org

Mime
View raw message