samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roger Hoover <roger.hoo...@gmail.com>
Subject Re: Samza and KStreams (KIP-28): LinkedIn's POV
Date Sat, 03 Oct 2015 18:13:44 GMT
Hi Yi,

Thank you for sharing this update and perspective.  I tend to agree that
for simple, stateless cases, things could be easier and hopefully KStreams
may help with that.  I also appreciate a lot of features that Samza already
supports for operations.  As previously discussed, the biggest request I
have is being able to run Samza without YARN, under something like
Kubernetes instead.

Also, I'm curious.  What's the current state of the Samza SQL physical
operators?  Are they used in production yet?  Is there a doc on how to use
them?

Thanks,

Roger



On Fri, Oct 2, 2015 at 1:54 PM, Yi Pan <nickpan47@gmail.com> wrote:

> Hi, all Samza-lovers,
>
> This question on the relationship of Kafka KStream (KIP-28) and Samza has
> come up a couple times recently. So we wanted to clarify where we stand at
> LinkedIn in terms of this discussion.
>
> Samza has historically had a symbiotic relationship with Kafka and will
> continue to work very well with Kafka.  Earlier in the year, we had an
> in-depth discussion exploring an even deeper integration with Kafka.  After
> hitting multiple practical issues (e.g. Apache rules) and technical issues
> we had to give up on that idea.  As a fall out of the discussion, the Kafka
> community is adding some of the basic event processing capabilities into
> Kafka core directly. The basic callback/push style programming model by
> itself is a great addition to the Kafka API set.
>
> However at LinkedIn, we continue to be firmly committed to Samza as our
> stream processing framework. Although KStream is a nice addition to Kafka
> stack, our goals for Samza are broader. There are some key technical
> differences that makes Samza the right strategy for us.
>
> 1.  Support for non-kafka systems :
>
> At LinkedIn a larger percentage of our streaming jobs use Databus as an
> input source.   For any such non-Kafka source, although the CopyCat
> connector framework gives a common model for pulling data out of a source
> and pushing it into Kafka, it introduces yet another piece of
> infrastructure that we have to operate and manage.  Also for any companies
> who are already on AWS, Google Compute, Azure etc.  asking them to deploy
> and operate kafka in AWS instead of using the natively supported services
> like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a
> non-starter.  With more acquisitions at LinkedIn that use AWS we are also
> facing this first hand.  The Samza community has a healthy set of system
> producers/consumers which are in the works (Kinesis, ActiveMQ,
> ElasticSearch, HDFS, etc.).
>
> 2. We run Samza as a Stream Processing Service at LinkedIn. This is
> fundamentally different from KStream.
>
> This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream
> Insight and similar services.  The service makes it much easier for
> developers to get their stream processing jobs up and running in production
> by helping with the most common problems like monitoring, dashboards,
> auto-scale, rolling upgrades and such.
>
> The truth is that if the stream processing application is stateless then
> some of these common problems are not as involved and can be solved even on
> regular IaaS platforms like EC2 and such.   Arguably stateless applications
> can be built easily on top of the native APIs from the input source like
> kafka, kinesis etc.
>
> The place where Samza shines is with stateful stream processing
> applications.  When a Samza application uses the local rocks DB based
> state, the application needs special care in terms of rolling upgrades,
> addition/removal of containers/machines, temporary machine failures,
> capacity management.  We have already done some really good work in Samza
> 0.10 to make sure that we don't reseed the state from kafka (i.e.
> host-affinity feature that allows to reuse the local states).  In the
> absence of this feature, we had multiple production issues caused due to
> network saturation during state reseeding from kafka.   The problems with
> stateful applications are similar to problems encountered when building
> no-sql databases and other data systems.
>
> There are surely some scenarios where customers don't want any YARN
> dependency and want to run their stream processing application on a
> dedicated set of nodes.  This is where KStream clearly has an advantage
> over current Samza. Prior to KStream we had a patch in Samza which also
> solved the same problem (SAMZA-516). We do expect to finish this patch soon
> and formally support Stand Alone Samza.
>
> 3. Operators for Stream Processing and SQL :
>
> At LinkedIn, there is a growing need to iterate Samza jobs faster in the
> loop of implement, deploy, revise the code, and deploy again. A key
> bottleneck that slows down this iteration is the implementation of a Samza
> job. It has long-been recognized in the Samza community that there is a
> strong need for a high-level language support to shorten this iterative
> process. Since last year, we have identified SQL as the user-facing
> high-level language and completed the high-level design and started
> prototyping it in Samza. The prototype starts with a set of physical
> operators which are crucial to the correctness of streaming processing,
> namely, the window operator, aggregation, and join. KStream adopts some of
> these core ideas. However, our view in Samza’s SQL support goes beyond
> what’s covered in KStream. We want Samza’s SQL support to be as easy as
> Google Dataflow and Azure Stream Analytics, in which a user can upload a
> query statement and the system will parse the query, translate it into a
> distributed execution plan, allocate the containers and stream resources in
> a cluster, and deploy it. To support this grand vision, our effort in
> building the SQL operators API, the query planner and optimizers is vastly
> different from what KStream covers, which only covers a single node
> programming interface.
>
> Independent of these strategic differences, one big aspect for us is also
> the fact that Samza is an established and mature system which we have
> successfully operationalized and has been running in production for a few
> years.
>
>
>
> Thanks!
>
> -Yi Pan
> Samza @ LinkedIn
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message