spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@gmail.com>
Subject Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data
Date Tue, 01 Mar 2016 17:35:57 GMT
Looked at the paper: while we can argue on the performance side, I think
semantically the Scala pattern matching is much more expressive.  The time
will decide.

On Tue, Mar 1, 2016 at 9:07 AM, Jerry Lam <chilinglam@gmail.com> wrote:

> Hi Alex,
>
> We went through this path already :) This is the reason we try other
> approaches. The recursion makes it very inefficient for some cases.
> For details, this paper describes it very well:
> https://people.cs.umass.edu/%7Eyanlei/publications/sase-sigmod08.pdf
> which is the same paper references in Flink ticket.
>
> Please let me know if I overlook something. Thank you for sharing this!
>
> Best Regards,
>
> Jerry
>
> On Tue, Mar 1, 2016 at 11:58 AM, Alex Kozlov <alexvk@gmail.com> wrote:
>
>> For the purpose of full disclosure, I think Scala offers a much more
>> efficient pattern matching paradigm.  Using nPath is like using assembler
>> to program distributed systems.  Cannot tell much here today, but the
>> pattern would look like:
>>
>>      |     def matchSessions(h: Seq[Session[PageView]], id: String, p:
>> Seq[PageView]) :
>>
>> Seq[Session[PageView]] = {    |       p match {
>>
>>      |         case Nil => Nil
>>
>>      |         case PageView(ts1, "company.com>homepage") ::
>> PageView(ts2,
>>
>> "company.com>plus>products landing") :: tail if ts2 > ts1 + 600 =>
>>
>>      |           matchSessions(h, id, tail).+:(new Session(id, p))
>>
>>      |         case _ => matchSessions(h, id, p.tail)
>>
>>      |       }
>>
>> Look for Scala case statements with guards and upcoming book releases.
>>
>> http://docs.scala-lang.org/tutorials/tour/pattern-matching
>>
>> https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch03s14.html
>>
>> On Tue, Mar 1, 2016 at 8:34 AM, Henri Dubois-Ferriere <henridf@gmail.com>
>> wrote:
>>
>>> fwiw Apache Flink just added CEP. Queries are constructed
>>> programmatically rather than in SQL, but the underlying functionality is
>>> similar.
>>>
>>> https://issues.apache.org/jira/browse/FLINK-3215
>>>
>>> On 1 March 2016 at 08:19, Jerry Lam <chilinglam@gmail.com> wrote:
>>>
>>>> Hi Herman,
>>>>
>>>> Thank you for your reply!
>>>> This functionality usually finds its place in financial services which
>>>> use CEP (complex event processing) for correlation and pattern matching.
>>>> Many commercial products have this including Oracle and Teradata Aster Data
>>>> MR Analytics. I do agree the syntax a bit awkward but after you understand
>>>> it, it is actually very compact for expressing something that is very
>>>> complex. Esper has this feature partially implemented (
>>>> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
>>>> ).
>>>>
>>>> I found the Teradata Analytics documentation best to describe the usage
>>>> of it. For example (note npath is similar to match_recognize):
>>>>
>>>> SELECT last_pageid, MAX( count_page80 )
>>>>  FROM nPath(
>>>>  ON ( SELECT * FROM clicks WHERE category >= 0 )
>>>>  PARTITION BY sessionid
>>>>  ORDER BY ts
>>>>  PATTERN ( 'A.(B|C)*' )
>>>>  MODE ( OVERLAPPING )
>>>>  SYMBOLS ( pageid = 50 AS A,
>>>>            pageid = 80 AS B,
>>>>            pageid <> 80 AND category IN (9,10) AS C )
>>>>  RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
>>>>           COUNT ( * OF B ) AS count_page80,
>>>>           COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
>>>>  )
>>>>  WHERE count_any >= 5
>>>>  GROUP BY last_pageid
>>>>  ORDER BY MAX( count_page80 )
>>>>
>>>> The above means:
>>>> Find user click-paths starting at pageid 50 and passing exclusively
>>>> through either pageid 80 or pages in category 9 or category 10. Find the
>>>> pageid of the last page in the path and count the number of times page 80
>>>> was visited. Report the maximum count for each last page, and sort the
>>>> output by the latter. Restrict to paths containing at least 5 pages. Ignore
>>>> pages in the sequence with category < 0.
>>>>
>>>> If this query is written in pure SQL (if possible at all), it requires
>>>> several self-joins. The interesting thing about this feature is that it
>>>> integrates SQL+Streaming+ML in one (perhaps potentially graph too).
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>>
>>>> On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
>>>> hvanhovell@questtec.nl> wrote:
>>>>
>>>>> Hi Jerry,
>>>>>
>>>>> This is not on any roadmap. I (shortly) browsed through this; and this
>>>>> looks like some sort of a window function with very awkward syntax. I
think
>>>>> spark provided better constructs for this using dataframes/datasets/nested
>>>>> data...
>>>>>
>>>>> Feel free to submit a PR.
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Herman van Hövell
>>>>>
>>>>> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chilinglam@gmail.com>:
>>>>>
>>>>>> Hi Spark developers,
>>>>>>
>>>>>> Will you consider to add support for implementing "Pattern matching
>>>>>> in sequences of rows"? More specifically, I'm referring to this:
>>>>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>>>>>
>>>>>> This is a very cool/useful feature to pattern matching over live
>>>>>> stream/archived data. It is sorted of related to machine learning
because
>>>>>> this is usually used in clickstream analysis or path analysis. Also
it is
>>>>>> related to streaming because of the nature of the processing (time
series
>>>>>> data mostly). It is SQL because there is a good way to express and
optimize
>>>>>> the query.
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Jerry
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Alex Kozlov
>> (408) 507-4987
>> (650) 887-2135 efax
>> alexvk@gmail.com
>>
>
>


-- 
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
alexvk@gmail.com

Mime
View raw message