For the purpose of full disclosure, I think Scala offers a much more efficient pattern matching paradigm.  Using nPath is like using assembler to program distributed systems.  Cannot tell much here today, but the pattern would look like:

     |     def matchSessions(h: Seq[Session[PageView]], id: String, p: Seq[PageView]) : 

Seq[Session[PageView]] = {    |       p match {

     |         case Nil => Nil

     |         case PageView(ts1, "company.com>homepage") :: PageView(ts2, 

"company.com>plus>products landing") :: tail if ts2 > ts1 + 600 =>

     |           matchSessions(h, id, tail).+:(new Session(id, p))

     |         case _ => matchSessions(h, id, p.tail)

     |       }

Look for Scala case statements with guards and upcoming book releases.

http://docs.scala-lang.org/tutorials/tour/pattern-matching
https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch03s14.html

On Tue, Mar 1, 2016 at 8:34 AM, Henri Dubois-Ferriere <henridf@gmail.com> wrote:
fwiw Apache Flink just added CEP. Queries are constructed programmatically rather than in SQL, but the underlying functionality is similar.

https://issues.apache.org/jira/browse/FLINK-3215

On 1 March 2016 at 08:19, Jerry Lam <chilinglam@gmail.com> wrote:
Hi Herman,

Thank you for your reply!
This functionality usually finds its place in financial services which use CEP (complex event processing) for correlation and pattern matching. Many commercial products have this including Oracle and Teradata Aster Data MR Analytics. I do agree the syntax a bit awkward but after you understand it, it is actually very compact for expressing something that is very complex. Esper has this feature partially implemented (http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html). 

I found the Teradata Analytics documentation best to describe the usage of it. For example (note npath is similar to match_recognize):

SELECT last_pageid, MAX( count_page80 )
 FROM nPath(
 ON ( SELECT * FROM clicks WHERE category >= 0 )
 PARTITION BY sessionid
 ORDER BY ts
 PATTERN ( 'A.(B|C)*' )
 MODE ( OVERLAPPING )
 SYMBOLS ( pageid = 50 AS A,
           pageid = 80 AS B,
           pageid <> 80 AND category IN (9,10) AS C )
 RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
          COUNT ( * OF B ) AS count_page80,
          COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
 )
 WHERE count_any >= 5
 GROUP BY last_pageid
 ORDER BY MAX( count_page80 )

The above means:
Find user click-paths starting at pageid 50 and passing exclusively through either pageid 80 or pages in category 9 or category 10. Find the pageid of the last page in the path and count the number of times page 80 was visited. Report the maximum count for each last page, and sort the output by the latter. Restrict to paths containing at least 5 pages. Ignore pages in the sequence with category < 0.

If this query is written in pure SQL (if possible at all), it requires several self-joins. The interesting thing about this feature is that it integrates SQL+Streaming+ML in one (perhaps potentially graph too).

Best Regards,

Jerry


On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <hvanhovell@questtec.nl> wrote:
Hi Jerry,

This is not on any roadmap. I (shortly) browsed through this; and this looks like some sort of a window function with very awkward syntax. I think spark provided better constructs for this using dataframes/datasets/nested data... 

Feel free to submit a PR.

Kind regards,

Herman van Hövell

2016-03-01 15:16 GMT+01:00 Jerry Lam <chilinglam@gmail.com>:
Hi Spark developers,

Will you consider to add support for implementing "Pattern matching in sequences of rows"? More specifically, I'm referring to this: http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf

This is a very cool/useful feature to pattern matching over live stream/archived data. It is sorted of related to machine learning because this is usually used in clickstream analysis or path analysis. Also it is related to streaming because of the nature of the processing (time series data mostly). It is SQL because there is a good way to express and optimize the query. 

Best Regards,

Jerry






--
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
alexvk@gmail.com