spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henri Dubois-Ferriere <henr...@gmail.com>
Subject Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data
Date Tue, 01 Mar 2016 16:34:56 GMT
fwiw Apache Flink just added CEP. Queries are constructed programmatically
rather than in SQL, but the underlying functionality is similar.

https://issues.apache.org/jira/browse/FLINK-3215

On 1 March 2016 at 08:19, Jerry Lam <chilinglam@gmail.com> wrote:

> Hi Herman,
>
> Thank you for your reply!
> This functionality usually finds its place in financial services which use
> CEP (complex event processing) for correlation and pattern matching. Many
> commercial products have this including Oracle and Teradata Aster Data MR
> Analytics. I do agree the syntax a bit awkward but after you understand it,
> it is actually very compact for expressing something that is very complex.
> Esper has this feature partially implemented (
> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
> ).
>
> I found the Teradata Analytics documentation best to describe the usage of
> it. For example (note npath is similar to match_recognize):
>
> SELECT last_pageid, MAX( count_page80 )
>  FROM nPath(
>  ON ( SELECT * FROM clicks WHERE category >= 0 )
>  PARTITION BY sessionid
>  ORDER BY ts
>  PATTERN ( 'A.(B|C)*' )
>  MODE ( OVERLAPPING )
>  SYMBOLS ( pageid = 50 AS A,
>            pageid = 80 AS B,
>            pageid <> 80 AND category IN (9,10) AS C )
>  RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
>           COUNT ( * OF B ) AS count_page80,
>           COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
>  )
>  WHERE count_any >= 5
>  GROUP BY last_pageid
>  ORDER BY MAX( count_page80 )
>
> The above means:
> Find user click-paths starting at pageid 50 and passing exclusively
> through either pageid 80 or pages in category 9 or category 10. Find the
> pageid of the last page in the path and count the number of times page 80
> was visited. Report the maximum count for each last page, and sort the
> output by the latter. Restrict to paths containing at least 5 pages. Ignore
> pages in the sequence with category < 0.
>
> If this query is written in pure SQL (if possible at all), it requires
> several self-joins. The interesting thing about this feature is that it
> integrates SQL+Streaming+ML in one (perhaps potentially graph too).
>
> Best Regards,
>
> Jerry
>
>
> On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
> hvanhovell@questtec.nl> wrote:
>
>> Hi Jerry,
>>
>> This is not on any roadmap. I (shortly) browsed through this; and this
>> looks like some sort of a window function with very awkward syntax. I think
>> spark provided better constructs for this using dataframes/datasets/nested
>> data...
>>
>> Feel free to submit a PR.
>>
>> Kind regards,
>>
>> Herman van Hövell
>>
>> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chilinglam@gmail.com>:
>>
>>> Hi Spark developers,
>>>
>>> Will you consider to add support for implementing "Pattern matching in
>>> sequences of rows"? More specifically, I'm referring to this:
>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>>
>>> This is a very cool/useful feature to pattern matching over live
>>> stream/archived data. It is sorted of related to machine learning because
>>> this is usually used in clickstream analysis or path analysis. Also it is
>>> related to streaming because of the nature of the processing (time series
>>> data mostly). It is SQL because there is a good way to express and optimize
>>> the query.
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>
>>
>

Mime
View raw message