kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Maarek <steph...@simplemachines.com.au>
Subject Re: One big kafka connect cluster or many small ones?
Date Fri, 06 Jan 2017 05:50:39 GMT
So I just override the conf while doing the API call? It’d be great to have
this documented somewhere on the confluent website. I couldn’t find it.

On 6 January 2017 at 3:42:45 pm, Ewen Cheslack-Postava (ewen@confluent.io)

On Thu, Jan 5, 2017 at 7:19 PM, Stephane Maarek <
stephane@simplemachines.com.au> wrote:

> Thanks a lot for the guidance, I think we’ll go ahead with one cluster. I
> just need to figure out how our CD pipeline can talk to our connect cluster
> securely (because it’ll need direct access to perform API calls).

The documentation isn't great here, but you can apply all the normal
security configs to Connect (in distributed mode, it's basically equivalent
to a consumer, so everything you can do with a consumer you can do with

> Lastly, a question or maybe a piece of feedback… is it not possible to
> specify the key serializer and deserializer as part of the rest api job
> config?
> The issue is that sometimes our data is avro, sometimes it’s json. And it
> seems I’d need two separate clusters for that?

This is new! As of, we have
which allows you to include it in the connector config. It's called
"Converter" in Connect because it does a bit more than ser/des if you've
written them for Kafka, but they are basically just pluggable ser/des. We
knew folks would want this, it just took us awhile to find the bandwidth to
implement it. Now, you shouldn't need to do anything special or deploy
multiple clusters -- it's baked in and supported as long as you are willing
to override it on a per-connector basis (and this seems reasonable for most
folks since *ideally* you are *somewhat* standardized on a common
serialization format).


> On 6 January 2017 at 1:54:10 pm, Ewen Cheslack-Postava (ewen@confluent.io)
> wrote:
> On Thu, Jan 5, 2017 at 3:12 PM, Stephane Maarek <
> stephane@simplemachines.com.au> wrote:
> > Hi,
> >
> > We like to operate in micro-services (dockerize and ship everything on
> ecs)
> > and I was wondering which approach was preferred.
> > We have one kafka cluster, one zookeeper cluster, etc, but when it comes
> to
> > kafka connect I have some doubts.
> >
> > Is it better to have one big kafka connect with multiple nodes, or many
> > small kafka connect clusters or standalone, for each connector / etl ?
> >
> You can do any of these, and it may depend on how you do
> orchestration/deployment.
> We built Connect to support running one big cluster running a bunch of
> connectors. It balances work automatically and provides a way to control
> scale up/down via increased parallelism. This means we don't need to make
> any assumptions about how you deploy, how you handle elastically scaling
> your clusters, etc. But if you run in an environment and have the tooling
> in place to do that already, you can also opt to run many smaller clusters
> and use that tooling to scale up/down. In that case you'd just make sure
> there were enough tasks for each connector so that when you scale the # of
> workers for a cluster up the rebalancing of work would ensure there was
> enough tasks for every worker to remain occupied.
> The main drawback of doing this is that Connect uses a few topics to for
> configs, status, and offsets and you need these to be unique per cluster.
> This means you'll have 3N more topics. If you're running a *lot* of
> connectors, that could eventually become a problem. It also means you have
> that many more worker configs to handle, clusters to monitor, etc. And
> deploying a connector no longer becomes as simple as just making a call to
> the service's REST API since there isn't a single centralized service. The
> main benefits I can think of are a) if you already have preferred tooling
> for handling elasticity and b) better resource isolation between connectors
> (i.e. an OOM error in one connector won't affect any other connectors).
> For standalone mode, we'd generally recommend only using it when
> distributed mode doesn't make sense, e.g. for log file collection. Other
> than that, having the fault tolerance and high availability of distributed
> mode is preferred.
> On your specific points:
> >
> > The issues I’m trying to address are :
> > - Integration with our CI/CD pipeline
> >
> I'm not sure anything about Connect affects this. Is there a specific
> concern you have about the CI/CD pipeline & Connect?
> > - Efficient resources utilisation
> >
> Putting all the connectors into one cluster will probably result in better
> resource utilization unless you're already automatically tracking usage and
> scaling appropriately. The reason is that if you use a bunch of small
> clusters, you're now stuck trying to optimize N uses. Since Connect can
> already (roughly) balance work, putting all the work into one cluster and
> having connect split it up means you just need to watch utilization of the
> nodes in that one cluster and scale up or down as appropriate.
> > - Easily add new jar files that connectors depend on with minimal
> downtime
> >
> This one is a bit interesting. You shouldn't have any downtime adding jars
> in the sense that you can do rolling bounces of Connect. The one caveat is
> that the current limitation for how it rebalances work involves halting
> work for all connectors/tasks, doing the rebalance, and then starting them
> up again. We plan to improve this, but the timeframe for it is still
> uncertain. Usually these rebalance steps should be pretty quick. The main
> reason this can be a concern is that halting some connectors could take
> some time (e.g. because they need to fully flush their data). This means
> the period of time your connectors are not processing data during one of
> those rebalances is controlled by the "worst" connector.
> I would recommend trying a single cluster but monitoring whether you see
> stalls due to rebalances. If you do, then moving to multiple clusters might
> make sense. (This also, obviously, depends a lot on your SLA for data
> delivery.)
> > - Monitoring operations
> >
> Multiple clusters definitely seems messier and more complicated for this.
> There will be more workers in a single cluster, but it's a single service
> you need to monitor and maintain.
> Hope that helps!
> -Ewen
> >
> > Thanks for your guidance
> >
> > Regards,
> > Stephane
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message