kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewen Cheslack-Postava <e...@confluent.io>
Subject Re: One big kafka connect cluster or many small ones?
Date Fri, 06 Jan 2017 04:42:44 GMT
On Thu, Jan 5, 2017 at 7:19 PM, Stephane Maarek <
stephane@simplemachines.com.au> wrote:

> Thanks a lot for the guidance, I think we’ll go ahead with one cluster. I
> just need to figure out how our CD pipeline can talk to our connect cluster
> securely (because it’ll need direct access to perform API calls).
>

The documentation isn't great here, but you can apply all the normal
security configs to Connect (in distributed mode, it's basically equivalent
to a consumer, so everything you can do with a consumer you can do with
Connect).


>
> Lastly, a question or maybe a piece of feedback… is it not possible to
> specify the key serializer and deserializer as part of the rest api job
> config?
> The issue is that sometimes our data is avro, sometimes it’s json. And it
> seems I’d need two separate clusters for that?
>

This is new! As of 0.10.1.0, we have
https://cwiki.apache.org/confluence/display/KAFKA/KIP-75+-+Add+per-connector+Converters
which allows you to include it in the connector config. It's called
"Converter" in Connect because it does a bit more than ser/des if you've
written them for Kafka, but they are basically just pluggable ser/des. We
knew folks would want this, it just took us awhile to find the bandwidth to
implement it. Now, you shouldn't need to do anything special or deploy
multiple clusters -- it's baked in and supported as long as you are willing
to override it on a per-connector basis (and this seems reasonable for most
folks since *ideally* you are *somewhat* standardized on a common
serialization format).

-Ewen


>
> On 6 January 2017 at 1:54:10 pm, Ewen Cheslack-Postava (ewen@confluent.io)
> wrote:
>
> On Thu, Jan 5, 2017 at 3:12 PM, Stephane Maarek <
> stephane@simplemachines.com.au> wrote:
>
> > Hi,
> >
> > We like to operate in micro-services (dockerize and ship everything on
> ecs)
> > and I was wondering which approach was preferred.
> > We have one kafka cluster, one zookeeper cluster, etc, but when it comes
> to
> > kafka connect I have some doubts.
> >
> > Is it better to have one big kafka connect with multiple nodes, or many
> > small kafka connect clusters or standalone, for each connector / etl ?
> >
>
> You can do any of these, and it may depend on how you do
> orchestration/deployment.
>
> We built Connect to support running one big cluster running a bunch of
> connectors. It balances work automatically and provides a way to control
> scale up/down via increased parallelism. This means we don't need to make
> any assumptions about how you deploy, how you handle elastically scaling
> your clusters, etc. But if you run in an environment and have the tooling
> in place to do that already, you can also opt to run many smaller clusters
> and use that tooling to scale up/down. In that case you'd just make sure
> there were enough tasks for each connector so that when you scale the # of
> workers for a cluster up the rebalancing of work would ensure there was
> enough tasks for every worker to remain occupied.
>
> The main drawback of doing this is that Connect uses a few topics to for
> configs, status, and offsets and you need these to be unique per cluster.
> This means you'll have 3N more topics. If you're running a *lot* of
> connectors, that could eventually become a problem. It also means you have
> that many more worker configs to handle, clusters to monitor, etc. And
> deploying a connector no longer becomes as simple as just making a call to
> the service's REST API since there isn't a single centralized service. The
> main benefits I can think of are a) if you already have preferred tooling
> for handling elasticity and b) better resource isolation between
> connectors
> (i.e. an OOM error in one connector won't affect any other connectors).
>
> For standalone mode, we'd generally recommend only using it when
> distributed mode doesn't make sense, e.g. for log file collection. Other
> than that, having the fault tolerance and high availability of distributed
> mode is preferred.
>
> On your specific points:
>
> >
> > The issues I’m trying to address are :
> > - Integration with our CI/CD pipeline
> >
>
> I'm not sure anything about Connect affects this. Is there a specific
> concern you have about the CI/CD pipeline & Connect?
>
>
> > - Efficient resources utilisation
> >
>
> Putting all the connectors into one cluster will probably result in better
> resource utilization unless you're already automatically tracking usage
> and
> scaling appropriately. The reason is that if you use a bunch of small
> clusters, you're now stuck trying to optimize N uses. Since Connect can
> already (roughly) balance work, putting all the work into one cluster and
> having connect split it up means you just need to watch utilization of the
> nodes in that one cluster and scale up or down as appropriate.
>
>
> > - Easily add new jar files that connectors depend on with minimal
> downtime
> >
>
> This one is a bit interesting. You shouldn't have any downtime adding jars
> in the sense that you can do rolling bounces of Connect. The one caveat is
> that the current limitation for how it rebalances work involves halting
> work for all connectors/tasks, doing the rebalance, and then starting them
> up again. We plan to improve this, but the timeframe for it is still
> uncertain. Usually these rebalance steps should be pretty quick. The main
> reason this can be a concern is that halting some connectors could take
> some time (e.g. because they need to fully flush their data). This means
> the period of time your connectors are not processing data during one of
> those rebalances is controlled by the "worst" connector.
>
> I would recommend trying a single cluster but monitoring whether you see
> stalls due to rebalances. If you do, then moving to multiple clusters
> might
> make sense. (This also, obviously, depends a lot on your SLA for data
> delivery.)
>
>
> > - Monitoring operations
> >
>
> Multiple clusters definitely seems messier and more complicated for this.
> There will be more workers in a single cluster, but it's a single service
> you need to monitor and maintain.
>
> Hope that helps!
>
> -Ewen
>
>
> >
> > Thanks for your guidance
> >
> > Regards,
> > Stephane
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message