samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Pan <nickpa...@gmail.com>
Subject Re: Samza and KStreams (KIP-28): LinkedIn's POV
Date Mon, 05 Oct 2015 17:26:38 GMT
Hi, Renato,

Thanks a lot! Please feel free to distribute this message if you
encountered this question again. :)

Looking forward to your Kinesis and ActiveMQ patches!

Cheers!

-Yi

On Sun, Oct 4, 2015 at 9:45 PM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi Yi,
>
> Thanks a lot for sharing this POV, makes a lot more sense to me now what
> both (KStreams and Samza) are targeting to.
>
>
> Best,
>
> Renato M.
>
> 2015-10-03 20:13 GMT+02:00 Roger Hoover <roger.hoover@gmail.com>:
>
> > Hi Yi,
> >
> > Thank you for sharing this update and perspective.  I tend to agree that
> > for simple, stateless cases, things could be easier and hopefully
> KStreams
> > may help with that.  I also appreciate a lot of features that Samza
> already
> > supports for operations.  As previously discussed, the biggest request I
> > have is being able to run Samza without YARN, under something like
> > Kubernetes instead.
> >
> > Also, I'm curious.  What's the current state of the Samza SQL physical
> > operators?  Are they used in production yet?  Is there a doc on how to
> use
> > them?
> >
> > Thanks,
> >
> > Roger
> >
> >
> >
> > On Fri, Oct 2, 2015 at 1:54 PM, Yi Pan <nickpan47@gmail.com> wrote:
> >
> > > Hi, all Samza-lovers,
> > >
> > > This question on the relationship of Kafka KStream (KIP-28) and Samza
> has
> > > come up a couple times recently. So we wanted to clarify where we stand
> > at
> > > LinkedIn in terms of this discussion.
> > >
> > > Samza has historically had a symbiotic relationship with Kafka and will
> > > continue to work very well with Kafka.  Earlier in the year, we had an
> > > in-depth discussion exploring an even deeper integration with Kafka.
> > After
> > > hitting multiple practical issues (e.g. Apache rules) and technical
> > issues
> > > we had to give up on that idea.  As a fall out of the discussion, the
> > Kafka
> > > community is adding some of the basic event processing capabilities
> into
> > > Kafka core directly. The basic callback/push style programming model by
> > > itself is a great addition to the Kafka API set.
> > >
> > > However at LinkedIn, we continue to be firmly committed to Samza as our
> > > stream processing framework. Although KStream is a nice addition to
> Kafka
> > > stack, our goals for Samza are broader. There are some key technical
> > > differences that makes Samza the right strategy for us.
> > >
> > > 1.  Support for non-kafka systems :
> > >
> > > At LinkedIn a larger percentage of our streaming jobs use Databus as an
> > > input source.   For any such non-Kafka source, although the CopyCat
> > > connector framework gives a common model for pulling data out of a
> source
> > > and pushing it into Kafka, it introduces yet another piece of
> > > infrastructure that we have to operate and manage.  Also for any
> > companies
> > > who are already on AWS, Google Compute, Azure etc.  asking them to
> deploy
> > > and operate kafka in AWS instead of using the natively supported
> services
> > > like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a
> > > non-starter.  With more acquisitions at LinkedIn that use AWS we are
> also
> > > facing this first hand.  The Samza community has a healthy set of
> system
> > > producers/consumers which are in the works (Kinesis, ActiveMQ,
> > > ElasticSearch, HDFS, etc.).
> > >
> > > 2. We run Samza as a Stream Processing Service at LinkedIn. This is
> > > fundamentally different from KStream.
> > >
> > > This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream
> > > Insight and similar services.  The service makes it much easier for
> > > developers to get their stream processing jobs up and running in
> > production
> > > by helping with the most common problems like monitoring, dashboards,
> > > auto-scale, rolling upgrades and such.
> > >
> > > The truth is that if the stream processing application is stateless
> then
> > > some of these common problems are not as involved and can be solved
> even
> > on
> > > regular IaaS platforms like EC2 and such.   Arguably stateless
> > applications
> > > can be built easily on top of the native APIs from the input source
> like
> > > kafka, kinesis etc.
> > >
> > > The place where Samza shines is with stateful stream processing
> > > applications.  When a Samza application uses the local rocks DB based
> > > state, the application needs special care in terms of rolling upgrades,
> > > addition/removal of containers/machines, temporary machine failures,
> > > capacity management.  We have already done some really good work in
> Samza
> > > 0.10 to make sure that we don't reseed the state from kafka (i.e.
> > > host-affinity feature that allows to reuse the local states).  In the
> > > absence of this feature, we had multiple production issues caused due
> to
> > > network saturation during state reseeding from kafka.   The problems
> with
> > > stateful applications are similar to problems encountered when building
> > > no-sql databases and other data systems.
> > >
> > > There are surely some scenarios where customers don't want any YARN
> > > dependency and want to run their stream processing application on a
> > > dedicated set of nodes.  This is where KStream clearly has an advantage
> > > over current Samza. Prior to KStream we had a patch in Samza which also
> > > solved the same problem (SAMZA-516). We do expect to finish this patch
> > soon
> > > and formally support Stand Alone Samza.
> > >
> > > 3. Operators for Stream Processing and SQL :
> > >
> > > At LinkedIn, there is a growing need to iterate Samza jobs faster in
> the
> > > loop of implement, deploy, revise the code, and deploy again. A key
> > > bottleneck that slows down this iteration is the implementation of a
> > Samza
> > > job. It has long-been recognized in the Samza community that there is a
> > > strong need for a high-level language support to shorten this iterative
> > > process. Since last year, we have identified SQL as the user-facing
> > > high-level language and completed the high-level design and started
> > > prototyping it in Samza. The prototype starts with a set of physical
> > > operators which are crucial to the correctness of streaming processing,
> > > namely, the window operator, aggregation, and join. KStream adopts some
> > of
> > > these core ideas. However, our view in Samza’s SQL support goes beyond
> > > what’s covered in KStream. We want Samza’s SQL support to be as easy as
> > > Google Dataflow and Azure Stream Analytics, in which a user can upload
> a
> > > query statement and the system will parse the query, translate it into
> a
> > > distributed execution plan, allocate the containers and stream
> resources
> > in
> > > a cluster, and deploy it. To support this grand vision, our effort in
> > > building the SQL operators API, the query planner and optimizers is
> > vastly
> > > different from what KStream covers, which only covers a single node
> > > programming interface.
> > >
> > > Independent of these strategic differences, one big aspect for us is
> also
> > > the fact that Samza is an established and mature system which we have
> > > successfully operationalized and has been running in production for a
> few
> > > years.
> > >
> > >
> > >
> > > Thanks!
> > >
> > > -Yi Pan
> > > Samza @ LinkedIn
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message