samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix GV <>
Subject RE: How do you serve the data computed by Samza?
Date Sat, 28 Mar 2015 02:05:06 GMT
Hi Jordan,

Thanks for your response (:

I think I might not have done a good job at explaining the use case I'm interested in. The
use case is the following:

Samza does some computation (joins, counts, pattern detection, whatever the case may be...)
and the output of that computation needs to be served back online. In other words, there is
an application (web server or otherwise) that needs to query a data store to get some data
that was computed in nearline by Samza.

I'm interested in knowing which data stores people use for serving data that comes from Kafka.

I know that Kafka and Samza can churn out data at an insane rate (: ... That is the reason
why I am concerned about whether the data stores people use out there are capable of ingesting
that kind of write throughput effectively, or if they risk falling over or otherwise becoming
too slow to serve the read requests they are meant to serve.

What I meant with the re-processing part is that it seems likely that a data store would be
capable of ingesting the real-time throughput of a given Samza job, but that it may fall over
if that Samza job is scaled out in order to quickly re-process all historical data. The reason
why people would want to do that (re-process all historical data) is for any code change to
the Samza processor, be it bug fixes, new streams getting joined to the data, new features
getting extracted, etc.

I hope that clarifies what I meant. I'd love to hear about other people's experience if you
are willing to share (: !


Felix GV
Data Infrastructure Engineer
Distributed Data Systems

From: Jordan Shaw []
Sent: Friday, March 27, 2015 12:42 PM
Subject: Re: How do you serve the data computed by Samza?

Here are my thoughts below

1 - 2) I think so far a majority of samza applications are internal so far.
However I've developed a Samza Publisher for PubNub that would allow you to
send data from process or window out over a Data Stream Network. Right now
it looks something like this:

(.send collector (OutgoingMessageEnvelope. (SystemStream.
"pubnub.some-channel") {:pub_key demo :sub_key demo} some-data)).

At smaller scale you could do the same with etc... If you're
interested in this I can send you the src or jar. If their is wider
interest I can open source it on github but it needs some cleanup first.

3) We currently don't have the need to warehouse our stream but we have
thought about piping samza generated data into some Hadoop based system for
longer term analysis. Then running Hive queries over that data or something

4) I can't comment on the throughput of the other systems (HBase etc..) but
our Kafka, Samza through put is pretty impressive considering the single
thread nature of the system. We are seeing raw throughput per partition
over well 10MB/s.

5) I haven't run into this to prevent data loss/backup if we can't process
a message we have considered dropping it into a "unprocessed topic" but we
haven't really run into this need. If you needed to reprocess all raw data
it would be pretty straightforward, you could just add a partition to
support the extra load.

6) Kafka is pretty good at ingesting things so could you elaborate more on

On Fri, Mar 27, 2015 at 9:52 AM, Felix GV <>

> Hi Samza devs, users and enthusiasts,
> I've kept an eye on the Samza project for a while and I think it's super
> cool! I hope it continues to mature and expand as it seems very promising (:
> One thing I've been wondering for a while is: how do people serve the data
> they computed on Samza? More specifically:
>   1.  How do you expose the output of Samza jobs to online applications
> that need low-latency reads?
>   2.  Are these online apps mostly internal (i.e.: analytics, dashboards,
> etc.) or public/user-facing?
>   3.  What systems do you currently use (or plan to use in the short-term)
> to host the data generated in Samza? HBase? Cassandra? MySQL? Druid? Others?
>   4.  Are you satisfied or are you facing challenges in terms of the write
> throughput supported by these storage/serving systems? What about read
> throughput?
>   5.  Are there situations where you wish to re-process all historical
> data when making improvements to your Samza job, which results in the need
> to re-ingest all of the Samza output into your online serving system (as
> described in the Kappa Architecture<
> ? Is this easy breezy or painful? Do you need to throttle it lest your
> serving system will fall over?
>   6.  If there was a highly-optimized and reliable way of ingesting
> partitioned streams quickly into your online serving system, would that
> help you leverage Samza more effectively?
> Your insights would be much appreciated!
> Thanks (:
> --
> Felix

Jordan Shaw
Full Stack Software Engineer
PubNub Inc
1045 17th St
San Francisco, CA 94107

View raw message