spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Lam <chiling...@gmail.com>
Subject Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data
Date Tue, 01 Mar 2016 17:07:25 GMT
Hi Alex,

We went through this path already :) This is the reason we try other
approaches. The recursion makes it very inefficient for some cases.
For details, this paper describes it very well:
https://people.cs.umass.edu/%7Eyanlei/publications/sase-sigmod08.pdf
which is the same paper references in Flink ticket.

Please let me know if I overlook something. Thank you for sharing this!

Best Regards,

Jerry




On Tue, Mar 1, 2016 at 11:58 AM, Alex Kozlov <alexvk@gmail.com> wrote:

> For the purpose of full disclosure, I think Scala offers a much more
> efficient pattern matching paradigm.  Using nPath is like using assembler
> to program distributed systems.  Cannot tell much here today, but the
> pattern would look like:
>
>      |     def matchSessions(h: Seq[Session[PageView]], id: String, p:
> Seq[PageView]) :
>
> Seq[Session[PageView]] = {    |       p match {
>
>      |         case Nil => Nil
>
>      |         case PageView(ts1, "company.com>homepage") ::
> PageView(ts2,
>
> "company.com>plus>products landing") :: tail if ts2 > ts1 + 600 =>
>
>      |           matchSessions(h, id, tail).+:(new Session(id, p))
>
>      |         case _ => matchSessions(h, id, p.tail)
>
>      |       }
>
> Look for Scala case statements with guards and upcoming book releases.
>
> http://docs.scala-lang.org/tutorials/tour/pattern-matching
>
> https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch03s14.html
>
> On Tue, Mar 1, 2016 at 8:34 AM, Henri Dubois-Ferriere <henridf@gmail.com>
> wrote:
>
>> fwiw Apache Flink just added CEP. Queries are constructed
>> programmatically rather than in SQL, but the underlying functionality is
>> similar.
>>
>> https://issues.apache.org/jira/browse/FLINK-3215
>>
>> On 1 March 2016 at 08:19, Jerry Lam <chilinglam@gmail.com> wrote:
>>
>>> Hi Herman,
>>>
>>> Thank you for your reply!
>>> This functionality usually finds its place in financial services which
>>> use CEP (complex event processing) for correlation and pattern matching.
>>> Many commercial products have this including Oracle and Teradata Aster Data
>>> MR Analytics. I do agree the syntax a bit awkward but after you understand
>>> it, it is actually very compact for expressing something that is very
>>> complex. Esper has this feature partially implemented (
>>> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
>>> ).
>>>
>>> I found the Teradata Analytics documentation best to describe the usage
>>> of it. For example (note npath is similar to match_recognize):
>>>
>>> SELECT last_pageid, MAX( count_page80 )
>>>  FROM nPath(
>>>  ON ( SELECT * FROM clicks WHERE category >= 0 )
>>>  PARTITION BY sessionid
>>>  ORDER BY ts
>>>  PATTERN ( 'A.(B|C)*' )
>>>  MODE ( OVERLAPPING )
>>>  SYMBOLS ( pageid = 50 AS A,
>>>            pageid = 80 AS B,
>>>            pageid <> 80 AND category IN (9,10) AS C )
>>>  RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
>>>           COUNT ( * OF B ) AS count_page80,
>>>           COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
>>>  )
>>>  WHERE count_any >= 5
>>>  GROUP BY last_pageid
>>>  ORDER BY MAX( count_page80 )
>>>
>>> The above means:
>>> Find user click-paths starting at pageid 50 and passing exclusively
>>> through either pageid 80 or pages in category 9 or category 10. Find the
>>> pageid of the last page in the path and count the number of times page 80
>>> was visited. Report the maximum count for each last page, and sort the
>>> output by the latter. Restrict to paths containing at least 5 pages. Ignore
>>> pages in the sequence with category < 0.
>>>
>>> If this query is written in pure SQL (if possible at all), it requires
>>> several self-joins. The interesting thing about this feature is that it
>>> integrates SQL+Streaming+ML in one (perhaps potentially graph too).
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>> On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
>>> hvanhovell@questtec.nl> wrote:
>>>
>>>> Hi Jerry,
>>>>
>>>> This is not on any roadmap. I (shortly) browsed through this; and this
>>>> looks like some sort of a window function with very awkward syntax. I think
>>>> spark provided better constructs for this using dataframes/datasets/nested
>>>> data...
>>>>
>>>> Feel free to submit a PR.
>>>>
>>>> Kind regards,
>>>>
>>>> Herman van Hövell
>>>>
>>>> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chilinglam@gmail.com>:
>>>>
>>>>> Hi Spark developers,
>>>>>
>>>>> Will you consider to add support for implementing "Pattern matching in
>>>>> sequences of rows"? More specifically, I'm referring to this:
>>>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>>>>
>>>>> This is a very cool/useful feature to pattern matching over live
>>>>> stream/archived data. It is sorted of related to machine learning because
>>>>> this is usually used in clickstream analysis or path analysis. Also it
is
>>>>> related to streaming because of the nature of the processing (time series
>>>>> data mostly). It is SQL because there is a good way to express and optimize
>>>>> the query.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Alex Kozlov
> (408) 507-4987
> (650) 887-2135 efax
> alexvk@gmail.com
>

Mime
View raw message