I think the difficulty with JulianF’s signal processing domain is that he needs there to
be precisely one record at every clock tick (or more generally, at every point in an Ndimensional
discrete space).
Consider stock trading. A stock trade is an event that happens in continuous time, say
(9:58:02 ORCL 41), (10:01:55 ORCL 43)
Our query wants to know the stock price at 10:00 (or at any 1minute interval). Therefore
we have to convert the eventoriented data into an array:
(9:59 ORCL 41), (10:00 ORCL 41), (10:01 ORCL 41), (10:02 ORCL 43).
JulianF’s domain may be more naturally in the realm of array databases [1] but there are
a lot of advantages to relational algebra and SQL, not least that we have reasonable story
for streaming data, so let’s try to bridge the gap. Suppose we add a FILL operator that
converts an eventbased relation into a dense array:
SELECT *
FROM TABLE(FILL(Trades, ‘ROWTIME’, INTERVAL ‘1’ MINUTE))
Now we can safely join with other data at the same granularity.
Is this a step in the right direction?
Julian
[1] https://en.wikipedia.org/wiki/Array_DBMS
> On Dec 18, 2018, at 7:05 AM, Michael Mior <mmior@apache.org> wrote:
>
> I would say a similar theory applies. Some things are different when you're
> dealing with streams. Mainly joins and aggregations. Semantics are
> necessarily different whenever you have operations involving more than one
> row at a time from the input stream. When dealing with a relation an
> aggregation is straightforward since you just consume the entire input, and
> output the result of the aggregation. Since streams don't end, you need to
> decide how this is handled which usually amounts to a choice of windowing
> algorithm. There are a few other things to think about. The presentation
> linked below from Julian Hyde has a nice overview
>
> https://www.slideshare.net/julianhyde/streamingsql62376119
>
> 
> Michael Mior
> mmior@apache.org
>
>
> Le mar. 18 déc. 2018 à 02:28, Julian Feinauer <j.feinauer@pragmaticminds.de>
> a écrit :
>
>> Hi Michael,
>>
>> yes, our workloads are usually in the context of streaming (but for replay
>> or so we also use batch).
>> But, if I understand it correctly, the same theory applies to both, tables
>> ("relations") and streaming tables, or?
>> I hope to find time soon to write a PLC4X  Calicte source which creates
>> one or many streams based on readings from a plc.
>>
>> Julian
>>
>> Am 18.12.18, 03:19 schrieb "Michael Mior" <mmior@apache.org>:
>>
>> Perhaps you've thought of this already, but it sounds like streaming
>> relational algebra could be a good fit here.
>>
>> https://calcite.apache.org/docs/stream.html
>> 
>> Michael Mior
>> mmior@apache.org
>>
>>
>> Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <
>> j.feinauer@pragmaticminds.de>
>> a écrit :
>>
>>> Hi Calcitedevs,
>>>
>>> I just had a very interesting mail exchange with Julian (Hyde) on the
>>> incubator list [1]. It was about our project CRUNCH (which is mostly
>> about
>>> time series analyses and signal processing) and its relation to
>> relational
>>> algebra and I wanted to bring the discussion to this list to
>> continue here.
>>> We already had some discussion about how time series would work in
>> calcite
>>> [2] and it’s closely related to MATCH_RECOGNIZE.
>>>
>>> But, I have a more general question in mind, to ask the experts here
>> on
>>> the list.
>>> I ask myself if we can see the signal processing and analysis tasks
>> as
>>> proper application of relational algebra.
>>> Disclaimer, I’m mathematician, so I know the formals of (relational)
>>> algebra pretty well but I’m lacking a lot of experience and
>> knowledge in
>>> the database theory. Most of my knowledge there comes from Calcites
>> source
>>> code and the book from GarciaMolina and Ullman).
>>>
>>> So if we take, for example, a stream of signals from a sensor, then
>> we can
>>> of course do filtering or smoothing on it and this can be seen as a
>> mapping
>>> between the input relation and the output relation. But as we
>> usually need
>>> more than just one tuple at a time we lose many of the advantages of
>> the
>>> relational theory. And then, if we analyze the signal, we can again
>> model
>>> it as a mapping between relations, but the input relation is a “time
>>> series” and the output relation consists of “events”, so these are
>> in some
>>> way different dimensions. In this situation it becomes mostly
>> obvious where
>>> the main differences between time series and relational algebra are.
>> Think
>>> of something simple, an event should be registered, whenever the
>> signal
>>> switches from FALSE to TRUE (so not for every TRUE). This could also
>> be
>>> modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
>>> “unnatural” because we cannot use any indices (we don’t care about
>> the
>>> ratio of TRUE and FALSE in the DB, except for probably some very
>> rough
>>> outer bounds). And we are lacking the “right” information for the
>> optimizer
>>> like estimations on the number of analysis results.
>>> It gets even more complicated when moving to continuous valued
>> signals
>>> (INT, DOUBLE, …), e.g., temperature readings or something.
>>> If we want to analyze the number of times where we have a temperature
>>> change of more than 5 degrees in under 4 hours, this should also be
>> doable
>>> with MATCH_RECOGNIZE but again, there is no index to help us and we
>> have no
>>> information for the optimizer, so it feels very “black box” for the
>>> relational algebra.
>>>
>>> I’m not sure if you get my point, but for me, the elegance of
>> relational
>>> algebra was always this optimization stuff, which comes from
>> declarative
>>> and ends in an “optimal” physical plan. And I do not see how we can
>> use
>>> much of this for the examples given above.
>>>
>>> Perhaps, one solution would be to do the same as for spatial queries
>> (or
>>> the JSON / JSONB support in postgres, [3]) to add specialized
>> indices,
>>> statistics and optimizer rules. Then, this would make it more
>> “relational
>>> algebra”esque in the sense that there really is a possibility to
>> apply
>>> transformations to a given query.
>>>
>>> What do you think? Do I see things to complicated or am I missing
>>> something?
>>>
>>> Julian
>>>
>>> [1]
>>>
>> https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
>>> [2]
>>>
>> https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
>>> [3] https://www.postgresql.org/docs/9.4/datatypejson.html
>>>
>>>
>>
>>
>>
