samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Kreps <jay.kr...@gmail.com>
Subject Re: Using Samza to build a CEP (Complex Events Processing) system?
Date Wed, 04 Sep 2013 18:19:28 GMT
We are not using zookeeper as a storage system.

The key-value storage layer keeps data in level db per task. Level DB
scales okay up to dozens of GBs, so if you have 50 tasks then maybe on the
order of 50*20GB=~1TB might be reasonable.

The more detailed tradeoffs are the following:
- As the per-container data grows the restore time for a failed container
will grow. You can alleviate this by adding more containers. It is
reasonable to expect to be able to restore at a rate of around 50+MB/sec.
So if you have 10GB in a container it will take around 3 mins to restore.
- As the data per-container grows beyond the capacity of memory the
performance of random reads will decrease from memory access-like times to
seek performance (i.e. us to ms). The performance of writes will be less
effected.
- Indeed there is some housekeeping to do. In particular if you have a
changelog to restore your store from this changelog must be compacted.
Compaction can handle about 50MB/sec per partition, and there is one
partition per task/store combination.

Hope that answers your questions.

-Jay


On Wed, Sep 4, 2013 at 12:06 AM, Alex The Rocker <alex.m3tal@gmail.com>wrote:

> Hi,
>
> Thanks all for answers and sub-topics.
> One sub-topic that puzzles me (still with the goal of building CEP) is how
> Samza key/value store scales.
> I have read couple of scary stories about project using Zookeeper reaching
> some limits, so I think the question is worth.
> How large Samza key/value store can be?
> How its size impacts overall performances?
> Is there some housekeeping to do, for example to clean orphan key/values?
>
> Thanks,
> Alex.
>
>
>
> On Mon, Sep 2, 2013 at 5:20 PM, Jay Kreps <jay.kreps@gmail.com> wrote:
>
> > Hi Tim,
> >
> > What we have in the way of documentation on storage is here:
> >
> >
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html
> >
> > It's a bit sparse. If you help me understand your specific questions I
> will
> > make sure I cover them.
> >
> > -Jay
> >
> >
> > On Sat, Aug 31, 2013 at 1:58 PM, Timothy Chen <tnachen@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I wonder if there are details about the new key/value store Samza
> > > provides? Especially the design and how it handles scale, consistency
> > > guarantees etc.
> > >
> > > Tim
> > >
> > > On Aug 31, 2013, at 12:56 PM, Alex The Rocker <alex.m3tal@gmail.com>
> > > wrote:
> > >
> > > > Chris,
> > > >
> > > > Thanks you very much for your detailed.
> > > > Another system for processing real-time data just came to my
> attention
> > > > (thanks to Kafka mailing list, again).
> > > > It's called Druid (more at: http://druid.io).
> > > >
> > > > While I now understand Samza advantages over Storm for building a
> CEP,
> > I
> > > am
> > > > wondering how Samza compares to Druid.
> > > > I guess I may not alone wondering about Samza vs. Druid, so you may
> > want
> > > to
> > > > add a Samza vs. Druid" item in Samza documenation :)
> > > >
> > > > Thanks,
> > > > Alex.
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Aug 25, 2013 at 5:26 PM, Chris Riccomini <
> > > criccomini@linkedin.com>wrote:
> > > >
> > > >> Hey Alex,
> > > >>
> > > >> As I understand it, the CEP pattern you describing is, "look for a
> > > series
> > > >> of events within some bounded time frame, and take an action based
> on
> > > the
> > > >> combination of events." You use an example of three events arriving
> > > within
> > > >> 10 minutes of each other, consecutively. Wikipedia uses a similar
> > > example
> > > >> (wedding bell event + man in suit event + woman in white dress
> event +
> > > >> rice thrown event = wedding) on their CEP page.
> > > >>
> > > >> This pattern can be implemented in Samza fairly easily using Samza's
> > > >> key/value store (or some other StorageEngine, if you choose to
> > implement
> > > >> it). It's best to use a key/value store for this use case, since the
> > > >> window might be quite long (10 minutes), and all events in the
> window
> > > >> might not fit in memory. If you use Samza's key/value store, you can
> > put
> > > >> each message (and a timestamp) into the key/value store as the
> > messages
> > > >> arrive. You can then implement the WindowableTask interface along
> with
> > > the
> > > >> StreamTask interface, and configure Samza to call window() on your
> > task
> > > >> every N seconds (say, task.window.ms=60000). The window method
> could
> > > then
> > > >> do a range query on the key/value store, and check for message
> chains
> > > >> (e.g. E1 -> E2 -> E3) that were last updated > 10 minutes
ago. If an
> > > >> expected message was missing, you could then take some action (send
> an
> > > >> alert, or whatever).
> > > >>
> > > >> In general, when I think CEP, I think Esper (
> > http://esper.codehaus.org/
> > > ).
> > > >> You should be able to implement a lot of CEP/SQL type commands
> > (SELECT,
> > > >> JOIN, COUNT, SUM, DISTINCT, WHERE, GROUP BY, HAVING, WINDOW, ORDER,
> > etc)
> > > >> using Samza's StreamTask interface, and is state management
> > facilities.
> > > >>
> > > >> Beyond state management, most features in Samza enable CEP
> processing,
> > > in
> > > >> one way or another. From your perspective, you can look at Samza as
> > the
> > > >> underlying framework with which you might choose to implement a CEP
> > type
> > > >> system (think MapReduce is to Hive as Samza is to a CEP system).
> > > Specific
> > > >> things that help are its WindowableTask interface, the partitioning
> > > model
> > > >> (which lends itself to distributed joins and aggregation), and
> Samza's
> > > >> state management features.
> > > >>
> > > >> One thing to be aware of right now is Samza's "at least once"
> > messaging
> > > >> guarantee when failures occur (inherited from Kafka). You might
> > receive
> > > >> duplicate messages. This means you can potentially double count, if
> > > you're
> > > >> doing aggregation. In the example you give (E1, E2, E3), this
> > shouldn┬╣t
> > > be
> > > >> a problem. We have plans to provide exactly once messaging, but we
> > > haven't
> > > >> implemented the feature yet.
> > > >>
> > > >> Cheers,
> > > >> Chris
> > > >>
> > > >> On 8/24/13 12:05 PM, "Alex The Rocker" <alex.m3tal@gmail.com>
> wrote:
> > > >>
> > > >>> Hello,
> > > >>>
> > > >>> I just began to read about Samza, and I very excited about it
(I
> was
> > > >>> warned
> > > >>> of its existence by Jay Kreps' post in Kafka users list, BTW).
> > > >>>
> > > >>> My first reaction is: are you guys using it at LinkedIn for
> > > applications
> > > >>> which lies in the CEP (Complex Event Processing) system domain?
> > > >>>
> > > >>> To be more specific, would stateful Samza tasks be used in order
to
> > > >>> compute
> > > >>> complex states such as "event E1 is followed by E2 then by E3
with
> > less
> > > >>> than 10 minutes interval between each event" ?
> > > >>>
> > > >>> I was looking at Storm for CEP, but as pointed out in Samza Storm
> > page,
> > > >>> Storm leaves state management to the bolts code, whereas Samza
has
> > > >>> "something".
> > > >>>
> > > >>> Beyond state management, what else would make Samza a good building
> > > block
> > > >>> for a CEP?  Or a bad one?
> > > >>>
> > > >>> Thanks,
> > > >>> Alex.
> > > >>
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message