calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: MATCH_RECOGNIZE
Date Fri, 28 Dec 2018 01:10:33 GMT
I think we should get one example working end-to-end before moving to the next. (By “example”
I mean a SQL query on a standard data set that exercises one new feature, say FINAL.) Right
now nothing works end-to-end because we don’t have the basic code generation working.

I agree that assigning symbols to matched rows is a hard problem. I think the best approach
is to first figure out whether there is a match (that’s what Automaton does) and then, in
a second pass, assign a symbol to each row. The second pass might be significantly slower,
but only occurs less often. Also, AFAICT, symbol assignment is only required for the CLASSIFIER()
function. So I was going to defer that task.

Yes, I am actively working on this code. I see that your branch has significant changes because
you use a different coding style (e.g. different indentation). Please change your code back
to the existing style. There is no reason to make the task even more difficult than it already
is.

Julian


> On Dec 27, 2018, at 1:19 PM, Julian Feinauer <j.feinauer@pragmaticminds.de> wrote:
> 
> Hi Julian,
> 
> regarding "^" and "$" it seems like Zhiqiang already introduced the fields strictStart
and strictEnd in org.apache.calcite.rel.core.Match. But I agree with you and already had the
same idea. And I'll go over to you last commit to start my branch off.
> 
> I made some progress in my branch [1]. I get it to compile and I get the test `JdbcTest.testMatch`
to run and to fail (but no longer throw an exception, at least).
> I fixed several things at several places and I think the code generation is now working
(NOT working good) for Matcher and Emitter.
> But there are “crucial” points where I’d like to have your advice (or someone else
familiar with these topics):
> 
> First, I’m unsure how the FINAL function should be implemented (it’s no regular operator
and I did not find any reference on how to deal with this) so I “shortcutet” it by a reference
to the ABS function which is “noop” in the test case, see RexImplTable:385.
> 
> I also have no real idea about the implementation of PREV / LAST Operators. I think there
are some similarities to Window Aggregates and the PRECEDING / FOLLOWING operators, like RexWindowBound.
> 
> But currently I started working on a refactoring of the Matcher. Currently it only returns
the rows that matched but not the respective symbols the rows where matched to. They are necessary
for the emitter. I'm unsure whether to keep it based on an NFA or it is easier with a DFA.

> 
> Before I continue and dig more through the code base it would be good for me to have
some kind of feedback whether I’m going in the right direction and the things I do are of
any value or if I misunderstood or misinterpreted some parts.
> 
> JulianF
> 
> PS.: Are you actively working on the branch? We should synchronize to avoid duplicate
work.
> 
> [1] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize
> 
> Am 27.12.18, 21:21 schrieb "Julian Hyde" <jhyde@apache.org>:
> 
>    I think you can implement “^” by adding a special BEGIN state to the automaton.
Each automaton should be in this state on creation, and there is no inbound transition (i.e.
no way to get back into this state).
> 
>    And you can implement “$” by adding a special end-of-data symbol (you might as
well call it “$”) that is sent to each partition’s automaton when the input ends.
> 
>    These seem to be elegant solutions because most of the work is in Pattern and Automaton,
and can be unit-tested in AutomatonTest. Just a little extra plumbing needs to be added to
the runtime in order to use it.
> 
>    As you have noticed my branch https://github.com/julianhyde/calcite/tree/1935-match-recognize/
<https://github.com/julianhyde/calcite/tree/1935-match-recognize/> is broken as of the
latest commit. Consider starting your branch from the previous commit https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7
<https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7>.
We can sync up when my branch is working.
> 
>    Julian
> 
> 
>> On Dec 26, 2018, at 6:44 AM, Julian Feinauer <j.feinauer@pragmaticminds.de>
wrote:
>> 
>> Hi Julian,
>> 
>> I used [1] as reference. Anchors are explicitly stated as part of the syntax and
explained as:
>> 
>>> Anchors work in terms of positions rather than rows. They match a position either
at the start or end of a partition.
>>>   ^ matches the position before the first row in the partition.
>>>   $ matches the position after the last row in the partition.
>>> As an example, PATTERN (^A+$) will match only if all rows in a partition satisfy
the condition for A. The resulting match spans the entire partition.
>> 
>> Regarding patterns, I think it should not be a big change, as the anchors are defined
with respect to partition boundaries. So technically they do not have to see "beyond" boundaries
but should simply "see" boundaries.
>> So all we need should be an "outside partition" state which CAN be used as starting
or ending state (basically symbols "^" and "$" should reference that).
>> 
>> I'll see if I find a solution based on your code... I'll do the work in my branch
[2] based on your branch [3].
>> 
>> Best
>> JulianF
>> 
>> [1] https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956
>> [2] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize
>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize
>> 
>> Am 26.12.18, 08:49 schrieb "Julian Hyde" <jhyde@apache.org>:
>> 
>>   You are correct that my 1935-match-recognize branch doesn’t compile (as of 1a552a9).
I committed and pushed in the middle of a change because I had done a non-trivial rebase.
>> 
>>   I haven’t missed a file; the two compilation errors were intended to remind me
where to start work again. I am working on generating code to emit rows, and to populate measures
and predicates from the input row. If you can make progress on that, that would be awesome.
>> 
>>   Are anchors (“^” and “$”) supported by Oracle? If so can you point me to
the spec/examples. I am surprised that anything to do with patterns needs see beyond the boundaries
of the current partition. I had assumed that each partition has its own state machine and
it will be difficult to change that.
>> 
>>   Julian
>> 
>> 
>>> On Dec 25, 2018, at 2:56 PM, Julian Feinauer <j.feinauer@pragmaticminds.de>
wrote:
>>> 
>>> Hey,
>>> 
>>> it's once again me, JulianF.
>>> I started work on the Automaton / Matcher and implemented OR and OPTIONAL ("?")
to get started with the code.
>>> I would highly appreciate if you (Julian H) could check this code (I made a PR
to your branch).
>>> Then, what else did you consider as necessary for the implementation?
>>> I thought about anchors ("^", "$") but this would need a little bit of extra
changes in the PartitionStates, as far as I see it (to check when we "enter" a partition and
when we "leave".
>>> 
>>> Best
>>> JulianF
>>> 
>>> Am 25.12.18, 20:38 schrieb "Julian Feinauer" <j.feinauer@pragmaticminds.de>:
>>> 
>>>  Hi Julian,
>>> 
>>>  as I already declared my interest in MATCH_RECOGNIZE and offered my help, I
plan to do some things in the next one or two weeks.
>>>  Thus, I wanted to start based on your branch (“1935-match-recognize”).
>>> 
>>>  I have some problems getting it to run.
>>>  Is it possible that there are some files missing in the commit or are there
some things to consider?
>>> 
>>>  Thanks!
>>>  Julian (F)
>>> 
>>>  On 2018/11/26 20:09:00, Julian Hyde <j...@apache.org<mailto:j...@apache.org>>
wrote:
>>>> Over thanksgiving, I started working on MATCH_RECOGNIZE again. I wrote a
standalone class called Automaton that allows you to build patterns (basically regular expressions,
but sufficient for the PATTERN sub-clause of MATCH_RECOGNIZE), and execute them in a unit
test.>
>>>> 
>>>> Would someone like to help me develop this? We have support for “*” (zero
or more repeats), “+” (1 or more repeats) and “{m,n}” (bounded repeats) but need “|”
(or) and several others. It should be fairly straightforward test-driven development: add
tests to AutomatonTest.java [1], then change Automaton, AutomatonBuilder, Pattern or Matcher
until they pass.>
>>>> 
>>>> We also need lots of SQL tests. Could someone write queries against Oracle’s
“ticker” table and paste the queries and results into match.iq?>
>>>> 
>>>> See CALCITE-1935 [2], and my branch [3].>
>>>> 
>>>> I have cherry-picked commits from Zhiqiang He’s branch [4] into my branch,
so this will be a joint effort when it is finished.>
>>>> 
>>>> Julian>
>>>> 
>>>> [1] https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java
<https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java><https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java%3e>>
>>>> 
>>>> [2] https://issues.apache.org/jira/browse/CALCITE-1935 <https://issues.apache.org/jira/browse/CALCITE-1935><https://issues.apache.org/jira/browse/CALCITE-1935%3e>>
>>>> 
>>>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize/ <https://github.com/julianhyde/calcite/tree/1935-match-recognize/><https://github.com/julianhyde/calcite/tree/1935-match-recognize/%3e>>
>>>> 
>>>> [4] https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3
<https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3><https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3%3e>>
>>>> 
>>>> 
>>>>> On Nov 21, 2018, at 8:45 AM, Julian Feinauer <j....@pragmaticminds.de<mailto:j....@pragmaticminds.de>>
wrote:>
>>>>>> 
>>>>> Sorry, this is an old mail which got sent accidentally again by my mail
program.>
>>>>> Please ignore this and excuse this.>
>>>>>> 
>>>>> Julian>
>>>>>> 
>>>>> Am 21.11.18, 16:34 schrieb "Julian Feinauer" <j....@pragmaticminds.de<mailto:j....@pragmaticminds.de>>:>
>>>>>> 
>>>>> Hi Julian,>
>>>>>> 
>>>>> I decided to reply to this (old) email, because here some facts are noted.>
>>>>> Funnily, Apache Flink released their MATCH_RECOGNIZE Implementation yesterday.>
>>>>>> 
>>>>> So I recall that you and Zhigiang He did something on this.>
>>>>> I would like to have such a feature in Calcite (as stated in the other
mail) and could try to go into this a bit with a colleague of mine and give a bit of support
on this topic (In fact, it sounds like fun to us…).>
>>>>> Perhaps theres also the chance to learn something from Flinks implementation,
as you already had some contacts with them, I think?>
>>>>>> 
>>>>> Best>
>>>>> Julian>
>>>>>> 
>>>>> On 2018/07/23 17:53:57, Julian Hyde <j....@apache.org<mailto:j....@apache.org>>
wrote:>
>>>>>> For quite a while we have had partial support for MATCH_RECOGNIZE.
We support it in the parser and validator, but there is no runtime implementation. It’s
a shame, because MATCH_RECOGNIZE is an incredibly powerful SQL feature for both traditional
SQL (it’s in Oracle 12c) and for continuous query (aka complex event processing - CEP).>>
>>>>>>> 
>>>>>> I figure it’s time to change that. My plan is to implement it incrementally,
getting simple queries working to start with, then allow people to add more complex queries.>>
>>>>>>> 
>>>>>> In a dev branch [1], I’ve added a method Enumerables.match[2].
The idea is that if you supply an Enumerable of input data, a finite state machine to figure
out when a sequence of rows makes a match (represented by a transition function: (state, row)
-> state), and a function to convert a matched set of rows to a set of output rows. The
match method is fairly straightforward, and I almost have it finished.>>
>>>>>>> 
>>>>>> The complexity is in generating the finite state machine, emitter
function, and so forth.>>
>>>>>>> 
>>>>>> Can someone help me with this task? If your idea of fun is implementing
database algorithms, this is about as much fun as it gets. You learned about finite state
machines in college - this is your chance to actually write one!>>
>>>>>>> 
>>>>>> This might be a good joint project with the Flink community. I know
Flink are thinking of implementing CEP, and the algorithm we write here could be shared with
Flink (for use via Flink SQL or via the Flink API).>>
>>>>>>> 
>>>>>> Julian>>
>>>>>>> 
>>>>>> [1] https://github.com/julianhyde/calcite/commits/1935-match-recognize
<https://github.com/julianhyde/calcite/commits/1935-match-recognize>><https://github.com/julianhyde/calcite/commits/1935-match-recognize%3e%3e>>
>>>>>>> 
>>>>>> [2] https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73
<https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73>>>
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 
> 


Mime
View raw message