samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Kreps <jay.kr...@gmail.com>
Subject Re: Using Samza to build a CEP (Complex Events Processing) system?
Date Mon, 02 Sep 2013 15:20:29 GMT
Hi Tim,

What we have in the way of documentation on storage is here:
http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html

It's a bit sparse. If you help me understand your specific questions I will
make sure I cover them.

-Jay


On Sat, Aug 31, 2013 at 1:58 PM, Timothy Chen <tnachen@gmail.com> wrote:

> Hi,
>
> I wonder if there are details about the new key/value store Samza
> provides? Especially the design and how it handles scale, consistency
> guarantees etc.
>
> Tim
>
> On Aug 31, 2013, at 12:56 PM, Alex The Rocker <alex.m3tal@gmail.com>
> wrote:
>
> > Chris,
> >
> > Thanks you very much for your detailed.
> > Another system for processing real-time data just came to my attention
> > (thanks to Kafka mailing list, again).
> > It's called Druid (more at: http://druid.io).
> >
> > While I now understand Samza advantages over Storm for building a CEP, I
> am
> > wondering how Samza compares to Druid.
> > I guess I may not alone wondering about Samza vs. Druid, so you may want
> to
> > add a Samza vs. Druid" item in Samza documenation :)
> >
> > Thanks,
> > Alex.
> >
> >
> >
> >
> > On Sun, Aug 25, 2013 at 5:26 PM, Chris Riccomini <
> criccomini@linkedin.com>wrote:
> >
> >> Hey Alex,
> >>
> >> As I understand it, the CEP pattern you describing is, "look for a
> series
> >> of events within some bounded time frame, and take an action based on
> the
> >> combination of events." You use an example of three events arriving
> within
> >> 10 minutes of each other, consecutively. Wikipedia uses a similar
> example
> >> (wedding bell event + man in suit event + woman in white dress event +
> >> rice thrown event = wedding) on their CEP page.
> >>
> >> This pattern can be implemented in Samza fairly easily using Samza's
> >> key/value store (or some other StorageEngine, if you choose to implement
> >> it). It's best to use a key/value store for this use case, since the
> >> window might be quite long (10 minutes), and all events in the window
> >> might not fit in memory. If you use Samza's key/value store, you can put
> >> each message (and a timestamp) into the key/value store as the messages
> >> arrive. You can then implement the WindowableTask interface along with
> the
> >> StreamTask interface, and configure Samza to call window() on your task
> >> every N seconds (say, task.window.ms=60000). The window method could
> then
> >> do a range query on the key/value store, and check for message chains
> >> (e.g. E1 -> E2 -> E3) that were last updated > 10 minutes ago. If an
> >> expected message was missing, you could then take some action (send an
> >> alert, or whatever).
> >>
> >> In general, when I think CEP, I think Esper (http://esper.codehaus.org/
> ).
> >> You should be able to implement a lot of CEP/SQL type commands (SELECT,
> >> JOIN, COUNT, SUM, DISTINCT, WHERE, GROUP BY, HAVING, WINDOW, ORDER, etc)
> >> using Samza's StreamTask interface, and is state management facilities.
> >>
> >> Beyond state management, most features in Samza enable CEP processing,
> in
> >> one way or another. From your perspective, you can look at Samza as the
> >> underlying framework with which you might choose to implement a CEP type
> >> system (think MapReduce is to Hive as Samza is to a CEP system).
> Specific
> >> things that help are its WindowableTask interface, the partitioning
> model
> >> (which lends itself to distributed joins and aggregation), and Samza's
> >> state management features.
> >>
> >> One thing to be aware of right now is Samza's "at least once" messaging
> >> guarantee when failures occur (inherited from Kafka). You might receive
> >> duplicate messages. This means you can potentially double count, if
> you're
> >> doing aggregation. In the example you give (E1, E2, E3), this shouldn┬╣t
> be
> >> a problem. We have plans to provide exactly once messaging, but we
> haven't
> >> implemented the feature yet.
> >>
> >> Cheers,
> >> Chris
> >>
> >> On 8/24/13 12:05 PM, "Alex The Rocker" <alex.m3tal@gmail.com> wrote:
> >>
> >>> Hello,
> >>>
> >>> I just began to read about Samza, and I very excited about it (I was
> >>> warned
> >>> of its existence by Jay Kreps' post in Kafka users list, BTW).
> >>>
> >>> My first reaction is: are you guys using it at LinkedIn for
> applications
> >>> which lies in the CEP (Complex Event Processing) system domain?
> >>>
> >>> To be more specific, would stateful Samza tasks be used in order to
> >>> compute
> >>> complex states such as "event E1 is followed by E2 then by E3 with less
> >>> than 10 minutes interval between each event" ?
> >>>
> >>> I was looking at Storm for CEP, but as pointed out in Samza Storm page,
> >>> Storm leaves state management to the bolts code, whereas Samza has
> >>> "something".
> >>>
> >>> Beyond state management, what else would make Samza a good building
> block
> >>> for a CEP?  Or a bad one?
> >>>
> >>> Thanks,
> >>> Alex.
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message