samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Shaw <>
Subject Re: How do you serve the data computed by Samza?
Date Fri, 27 Mar 2015 19:42:02 GMT
Here are my thoughts below

1 - 2) I think so far a majority of samza applications are internal so far.
However I've developed a Samza Publisher for PubNub that would allow you to
send data from process or window out over a Data Stream Network. Right now
it looks something like this:

(.send collector (OutgoingMessageEnvelope. (SystemStream.
"pubnub.some-channel") {:pub_key demo :sub_key demo} some-data)).

At smaller scale you could do the same with etc... If you're
interested in this I can send you the src or jar. If their is wider
interest I can open source it on github but it needs some cleanup first.

3) We currently don't have the need to warehouse our stream but we have
thought about piping samza generated data into some Hadoop based system for
longer term analysis. Then running Hive queries over that data or something

4) I can't comment on the throughput of the other systems (HBase etc..) but
our Kafka, Samza through put is pretty impressive considering the single
thread nature of the system. We are seeing raw throughput per partition
over well 10MB/s.

5) I haven't run into this to prevent data loss/backup if we can't process
a message we have considered dropping it into a "unprocessed topic" but we
haven't really run into this need. If you needed to reprocess all raw data
it would be pretty straightforward, you could just add a partition to
support the extra load.

6) Kafka is pretty good at ingesting things so could you elaborate more on

On Fri, Mar 27, 2015 at 9:52 AM, Felix GV <>

> Hi Samza devs, users and enthusiasts,
> I've kept an eye on the Samza project for a while and I think it's super
> cool! I hope it continues to mature and expand as it seems very promising (:
> One thing I've been wondering for a while is: how do people serve the data
> they computed on Samza? More specifically:
>   1.  How do you expose the output of Samza jobs to online applications
> that need low-latency reads?
>   2.  Are these online apps mostly internal (i.e.: analytics, dashboards,
> etc.) or public/user-facing?
>   3.  What systems do you currently use (or plan to use in the short-term)
> to host the data generated in Samza? HBase? Cassandra? MySQL? Druid? Others?
>   4.  Are you satisfied or are you facing challenges in terms of the write
> throughput supported by these storage/serving systems? What about read
> throughput?
>   5.  Are there situations where you wish to re-process all historical
> data when making improvements to your Samza job, which results in the need
> to re-ingest all of the Samza output into your online serving system (as
> described in the Kappa Architecture<
> ? Is this easy breezy or painful? Do you need to throttle it lest your
> serving system will fall over?
>   6.  If there was a highly-optimized and reliable way of ingesting
> partitioned streams quickly into your online serving system, would that
> help you leverage Samza more effectively?
> Your insights would be much appreciated!
> Thanks (:
> --
> Felix

Jordan Shaw
Full Stack Software Engineer
PubNub Inc
1045 17th St
San Francisco, CA 94107

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message