spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@gmail.com>
Subject Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data
Date Tue, 01 Mar 2016 16:58:01 GMT
For the purpose of full disclosure, I think Scala offers a much more
efficient pattern matching paradigm.  Using nPath is like using assembler
to program distributed systems.  Cannot tell much here today, but the
pattern would look like:

     |     def matchSessions(h: Seq[Session[PageView]], id: String, p:
Seq[PageView]) :

Seq[Session[PageView]] = {    |       p match {

     |         case Nil => Nil

     |         case PageView(ts1, "company.com>homepage") :: PageView(ts2,

"company.com>plus>products landing") :: tail if ts2 > ts1 + 600 =>

     |           matchSessions(h, id, tail).+:(new Session(id, p))

     |         case _ => matchSessions(h, id, p.tail)

     |       }

Look for Scala case statements with guards and upcoming book releases.

http://docs.scala-lang.org/tutorials/tour/pattern-matching
https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch03s14.html

On Tue, Mar 1, 2016 at 8:34 AM, Henri Dubois-Ferriere <henridf@gmail.com>
wrote:

> fwiw Apache Flink just added CEP. Queries are constructed programmatically
> rather than in SQL, but the underlying functionality is similar.
>
> https://issues.apache.org/jira/browse/FLINK-3215
>
> On 1 March 2016 at 08:19, Jerry Lam <chilinglam@gmail.com> wrote:
>
>> Hi Herman,
>>
>> Thank you for your reply!
>> This functionality usually finds its place in financial services which
>> use CEP (complex event processing) for correlation and pattern matching.
>> Many commercial products have this including Oracle and Teradata Aster Data
>> MR Analytics. I do agree the syntax a bit awkward but after you understand
>> it, it is actually very compact for expressing something that is very
>> complex. Esper has this feature partially implemented (
>> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
>> ).
>>
>> I found the Teradata Analytics documentation best to describe the usage
>> of it. For example (note npath is similar to match_recognize):
>>
>> SELECT last_pageid, MAX( count_page80 )
>>  FROM nPath(
>>  ON ( SELECT * FROM clicks WHERE category >= 0 )
>>  PARTITION BY sessionid
>>  ORDER BY ts
>>  PATTERN ( 'A.(B|C)*' )
>>  MODE ( OVERLAPPING )
>>  SYMBOLS ( pageid = 50 AS A,
>>            pageid = 80 AS B,
>>            pageid <> 80 AND category IN (9,10) AS C )
>>  RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
>>           COUNT ( * OF B ) AS count_page80,
>>           COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
>>  )
>>  WHERE count_any >= 5
>>  GROUP BY last_pageid
>>  ORDER BY MAX( count_page80 )
>>
>> The above means:
>> Find user click-paths starting at pageid 50 and passing exclusively
>> through either pageid 80 or pages in category 9 or category 10. Find the
>> pageid of the last page in the path and count the number of times page 80
>> was visited. Report the maximum count for each last page, and sort the
>> output by the latter. Restrict to paths containing at least 5 pages. Ignore
>> pages in the sequence with category < 0.
>>
>> If this query is written in pure SQL (if possible at all), it requires
>> several self-joins. The interesting thing about this feature is that it
>> integrates SQL+Streaming+ML in one (perhaps potentially graph too).
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
>> hvanhovell@questtec.nl> wrote:
>>
>>> Hi Jerry,
>>>
>>> This is not on any roadmap. I (shortly) browsed through this; and this
>>> looks like some sort of a window function with very awkward syntax. I think
>>> spark provided better constructs for this using dataframes/datasets/nested
>>> data...
>>>
>>> Feel free to submit a PR.
>>>
>>> Kind regards,
>>>
>>> Herman van Hövell
>>>
>>> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chilinglam@gmail.com>:
>>>
>>>> Hi Spark developers,
>>>>
>>>> Will you consider to add support for implementing "Pattern matching in
>>>> sequences of rows"? More specifically, I'm referring to this:
>>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>>>
>>>> This is a very cool/useful feature to pattern matching over live
>>>> stream/archived data. It is sorted of related to machine learning because
>>>> this is usually used in clickstream analysis or path analysis. Also it is
>>>> related to streaming because of the nature of the processing (time series
>>>> data mostly). It is SQL because there is a good way to express and optimize
>>>> the query.
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>
>>>
>>
>


-- 
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
alexvk@gmail.com

Mime
View raw message