samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex The Rocker <alex.m3...@gmail.com>
Subject Re: Using Samza to build a CEP (Complex Events Processing) system?
Date Wed, 04 Sep 2013 07:06:12 GMT
Hi,

Thanks all for answers and sub-topics.
One sub-topic that puzzles me (still with the goal of building CEP) is how
Samza key/value store scales.
I have read couple of scary stories about project using Zookeeper reaching
some limits, so I think the question is worth.
How large Samza key/value store can be?
How its size impacts overall performances?
Is there some housekeeping to do, for example to clean orphan key/values?

Thanks,
Alex.



On Mon, Sep 2, 2013 at 5:20 PM, Jay Kreps <jay.kreps@gmail.com> wrote:

> Hi Tim,
>
> What we have in the way of documentation on storage is here:
>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html
>
> It's a bit sparse. If you help me understand your specific questions I will
> make sure I cover them.
>
> -Jay
>
>
> On Sat, Aug 31, 2013 at 1:58 PM, Timothy Chen <tnachen@gmail.com> wrote:
>
> > Hi,
> >
> > I wonder if there are details about the new key/value store Samza
> > provides? Especially the design and how it handles scale, consistency
> > guarantees etc.
> >
> > Tim
> >
> > On Aug 31, 2013, at 12:56 PM, Alex The Rocker <alex.m3tal@gmail.com>
> > wrote:
> >
> > > Chris,
> > >
> > > Thanks you very much for your detailed.
> > > Another system for processing real-time data just came to my attention
> > > (thanks to Kafka mailing list, again).
> > > It's called Druid (more at: http://druid.io).
> > >
> > > While I now understand Samza advantages over Storm for building a CEP,
> I
> > am
> > > wondering how Samza compares to Druid.
> > > I guess I may not alone wondering about Samza vs. Druid, so you may
> want
> > to
> > > add a Samza vs. Druid" item in Samza documenation :)
> > >
> > > Thanks,
> > > Alex.
> > >
> > >
> > >
> > >
> > > On Sun, Aug 25, 2013 at 5:26 PM, Chris Riccomini <
> > criccomini@linkedin.com>wrote:
> > >
> > >> Hey Alex,
> > >>
> > >> As I understand it, the CEP pattern you describing is, "look for a
> > series
> > >> of events within some bounded time frame, and take an action based on
> > the
> > >> combination of events." You use an example of three events arriving
> > within
> > >> 10 minutes of each other, consecutively. Wikipedia uses a similar
> > example
> > >> (wedding bell event + man in suit event + woman in white dress event +
> > >> rice thrown event = wedding) on their CEP page.
> > >>
> > >> This pattern can be implemented in Samza fairly easily using Samza's
> > >> key/value store (or some other StorageEngine, if you choose to
> implement
> > >> it). It's best to use a key/value store for this use case, since the
> > >> window might be quite long (10 minutes), and all events in the window
> > >> might not fit in memory. If you use Samza's key/value store, you can
> put
> > >> each message (and a timestamp) into the key/value store as the
> messages
> > >> arrive. You can then implement the WindowableTask interface along with
> > the
> > >> StreamTask interface, and configure Samza to call window() on your
> task
> > >> every N seconds (say, task.window.ms=60000). The window method could
> > then
> > >> do a range query on the key/value store, and check for message chains
> > >> (e.g. E1 -> E2 -> E3) that were last updated > 10 minutes ago.
If an
> > >> expected message was missing, you could then take some action (send an
> > >> alert, or whatever).
> > >>
> > >> In general, when I think CEP, I think Esper (
> http://esper.codehaus.org/
> > ).
> > >> You should be able to implement a lot of CEP/SQL type commands
> (SELECT,
> > >> JOIN, COUNT, SUM, DISTINCT, WHERE, GROUP BY, HAVING, WINDOW, ORDER,
> etc)
> > >> using Samza's StreamTask interface, and is state management
> facilities.
> > >>
> > >> Beyond state management, most features in Samza enable CEP processing,
> > in
> > >> one way or another. From your perspective, you can look at Samza as
> the
> > >> underlying framework with which you might choose to implement a CEP
> type
> > >> system (think MapReduce is to Hive as Samza is to a CEP system).
> > Specific
> > >> things that help are its WindowableTask interface, the partitioning
> > model
> > >> (which lends itself to distributed joins and aggregation), and Samza's
> > >> state management features.
> > >>
> > >> One thing to be aware of right now is Samza's "at least once"
> messaging
> > >> guarantee when failures occur (inherited from Kafka). You might
> receive
> > >> duplicate messages. This means you can potentially double count, if
> > you're
> > >> doing aggregation. In the example you give (E1, E2, E3), this
> shouldn┬╣t
> > be
> > >> a problem. We have plans to provide exactly once messaging, but we
> > haven't
> > >> implemented the feature yet.
> > >>
> > >> Cheers,
> > >> Chris
> > >>
> > >> On 8/24/13 12:05 PM, "Alex The Rocker" <alex.m3tal@gmail.com> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I just began to read about Samza, and I very excited about it (I was
> > >>> warned
> > >>> of its existence by Jay Kreps' post in Kafka users list, BTW).
> > >>>
> > >>> My first reaction is: are you guys using it at LinkedIn for
> > applications
> > >>> which lies in the CEP (Complex Event Processing) system domain?
> > >>>
> > >>> To be more specific, would stateful Samza tasks be used in order to
> > >>> compute
> > >>> complex states such as "event E1 is followed by E2 then by E3 with
> less
> > >>> than 10 minutes interval between each event" ?
> > >>>
> > >>> I was looking at Storm for CEP, but as pointed out in Samza Storm
> page,
> > >>> Storm leaves state management to the bolts code, whereas Samza has
> > >>> "something".
> > >>>
> > >>> Beyond state management, what else would make Samza a good building
> > block
> > >>> for a CEP?  Or a bad one?
> > >>>
> > >>> Thanks,
> > >>> Alex.
> > >>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message