kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewen Cheslack-Postava <e...@confluent.io>
Subject Re: One big kafka connect cluster or many small ones?
Date Fri, 06 Jan 2017 21:14:58 GMT
Yeah, you'd set the key.converter and/or value.converter in your connector


On Thu, Jan 5, 2017 at 9:50 PM, Stephane Maarek <
stephane@simplemachines.com.au> wrote:

> Thanks!
> So I just override the conf while doing the API call? It’d be great to
> have this documented somewhere on the confluent website. I couldn’t find
> it.
> On 6 January 2017 at 3:42:45 pm, Ewen Cheslack-Postava (ewen@confluent.io)
> wrote:
> On Thu, Jan 5, 2017 at 7:19 PM, Stephane Maarek <
> stephane@simplemachines.com.au> wrote:
>> Thanks a lot for the guidance, I think we’ll go ahead with one cluster. I
>> just need to figure out how our CD pipeline can talk to our connect cluster
>> securely (because it’ll need direct access to perform API calls).
> The documentation isn't great here, but you can apply all the normal
> security configs to Connect (in distributed mode, it's basically equivalent
> to a consumer, so everything you can do with a consumer you can do with
> Connect).
>> Lastly, a question or maybe a piece of feedback… is it not possible to
>> specify the key serializer and deserializer as part of the rest api job
>> config?
>> The issue is that sometimes our data is avro, sometimes it’s json. And it
>> seems I’d need two separate clusters for that?
> This is new! As of, we have https://cwiki.apache.org/
> confluence/display/KAFKA/KIP-75+-+Add+per-connector+Converters which
> allows you to include it in the connector config. It's called "Converter"
> in Connect because it does a bit more than ser/des if you've written them
> for Kafka, but they are basically just pluggable ser/des. We knew folks
> would want this, it just took us awhile to find the bandwidth to implement
> it. Now, you shouldn't need to do anything special or deploy multiple
> clusters -- it's baked in and supported as long as you are willing to
> override it on a per-connector basis (and this seems reasonable for most
> folks since *ideally* you are *somewhat* standardized on a common
> serialization format).
> -Ewen
>> On 6 January 2017 at 1:54:10 pm, Ewen Cheslack-Postava (ewen@confluent.io)
>> wrote:
>> On Thu, Jan 5, 2017 at 3:12 PM, Stephane Maarek <
>> stephane@simplemachines.com.au> wrote:
>> > Hi,
>> >
>> > We like to operate in micro-services (dockerize and ship everything on
>> ecs)
>> > and I was wondering which approach was preferred.
>> > We have one kafka cluster, one zookeeper cluster, etc, but when it
>> comes to
>> > kafka connect I have some doubts.
>> >
>> > Is it better to have one big kafka connect with multiple nodes, or many
>> > small kafka connect clusters or standalone, for each connector / etl ?
>> >
>> You can do any of these, and it may depend on how you do
>> orchestration/deployment.
>> We built Connect to support running one big cluster running a bunch of
>> connectors. It balances work automatically and provides a way to control
>> scale up/down via increased parallelism. This means we don't need to make
>> any assumptions about how you deploy, how you handle elastically scaling
>> your clusters, etc. But if you run in an environment and have the tooling
>> in place to do that already, you can also opt to run many smaller clusters
>> and use that tooling to scale up/down. In that case you'd just make sure
>> there were enough tasks for each connector so that when you scale the # of
>> workers for a cluster up the rebalancing of work would ensure there was
>> enough tasks for every worker to remain occupied.
>> The main drawback of doing this is that Connect uses a few topics to for
>> configs, status, and offsets and you need these to be unique per cluster.
>> This means you'll have 3N more topics. If you're running a *lot* of
>> connectors, that could eventually become a problem. It also means you have
>> that many more worker configs to handle, clusters to monitor, etc. And
>> deploying a connector no longer becomes as simple as just making a call to
>> the service's REST API since there isn't a single centralized service. The
>> main benefits I can think of are a) if you already have preferred tooling
>> for handling elasticity and b) better resource isolation between
>> connectors
>> (i.e. an OOM error in one connector won't affect any other connectors).
>> For standalone mode, we'd generally recommend only using it when
>> distributed mode doesn't make sense, e.g. for log file collection. Other
>> than that, having the fault tolerance and high availability of distributed
>> mode is preferred.
>> On your specific points:
>> >
>> > The issues I’m trying to address are :
>> > - Integration with our CI/CD pipeline
>> >
>> I'm not sure anything about Connect affects this. Is there a specific
>> concern you have about the CI/CD pipeline & Connect?
>> > - Efficient resources utilisation
>> >
>> Putting all the connectors into one cluster will probably result in better
>> resource utilization unless you're already automatically tracking usage
>> and
>> scaling appropriately. The reason is that if you use a bunch of small
>> clusters, you're now stuck trying to optimize N uses. Since Connect can
>> already (roughly) balance work, putting all the work into one cluster and
>> having connect split it up means you just need to watch utilization of the
>> nodes in that one cluster and scale up or down as appropriate.
>> > - Easily add new jar files that connectors depend on with minimal
>> downtime
>> >
>> This one is a bit interesting. You shouldn't have any downtime adding jars
>> in the sense that you can do rolling bounces of Connect. The one caveat is
>> that the current limitation for how it rebalances work involves halting
>> work for all connectors/tasks, doing the rebalance, and then starting them
>> up again. We plan to improve this, but the timeframe for it is still
>> uncertain. Usually these rebalance steps should be pretty quick. The main
>> reason this can be a concern is that halting some connectors could take
>> some time (e.g. because they need to fully flush their data). This means
>> the period of time your connectors are not processing data during one of
>> those rebalances is controlled by the "worst" connector.
>> I would recommend trying a single cluster but monitoring whether you see
>> stalls due to rebalances. If you do, then moving to multiple clusters
>> might
>> make sense. (This also, obviously, depends a lot on your SLA for data
>> delivery.)
>> > - Monitoring operations
>> >
>> Multiple clusters definitely seems messier and more complicated for this.
>> There will be more workers in a single cluster, but it's a single service
>> you need to monitor and maintain.
>> Hope that helps!
>> -Ewen
>> >
>> > Thanks for your guidance
>> >
>> > Regards,
>> > Stephane
>> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message