kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Falko <afa...@salesforce.com>
Subject Re: Kafka Replicated Partition Limits
Date Wed, 03 Jan 2018 23:00:50 GMT
Ben Wood:
1. We have 5 ZK nodes.
2. I only tracked outstanding requests thus far from ZK-side of
things. At 9.5k topics, I recorded about 5k outstanding requests. I'll
start tracking this better for my next run. Anything else worth
tracking?

Jun Rao:
I'm testing the latest 1.0.0. I'm testing normal mode. I don't take
anything down. I had a run where I tried to scale the brokers up, but
it didn't improve things. Thanks for pointing me at the KIP!

On Wed, Jan 3, 2018 at 2:50 PM, Jun Rao <jun@confluent.io> wrote:
> Hi, Andrey,
>
> Thanks for reporting the results. Which version of Kafka are you testing?
> Also, it would be useful to know if you are testing the normal mode when
> all replicas are up and in sync, or the failure mode when some of the
> replicas are being restarted. Typically, ZK is only accessed in the failure
> mode.
>
> We have made some significant improvement in the failure mode by reducing
> the logging overhead (KAFKA-6116) and making the ZK accesses async
> (KAFKA-5642). These won't necessarily reduce the number of requests to ZK,
> but will allow better pipelining when accessing ZK.
>
> In the normal mode, we are now discussing KIP-227, which could reduce the
> overhead for replication and consumption when there are many partitions.
>
> Jun
>
> On Wed, Jan 3, 2018 at 1:48 PM, Andrey Falko <afalko@salesforce.com> wrote:
>
>> Hi everyone,
>>
>> We are seeing more and more push from our Kafka users to support well
>> more than 10k replicated partitions. We'd ideally like to avoid running
>> multiple
>> clusters to keep our cluster management and monitoring simple. We started
>> testing kafka to see how many replicated partitions it could handle.
>>
>> We found that, to maintain SLAs of under 50ms for produce latency,
>> Kafka starts going downhill at around 9k topics with 5 brokers. Each topic
>> is
>> replicated 3x in our test. The bottleneck appears to be zookeeper:
>> after a certain
>> period of time, the number of outstanding requests in ZK spikes up at a
>> linear rate. Slowing down the rate at which we create and produce to
>> topics,
>> improves things, but doing that makes the system tougher to manage and use.
>> We are happy to publish our detailed results with reproduction
>> steps if anyone is interested.
>>
>> Has anyone overcome this problem and scaled beyond 9k replicated
>> partitions?
>> Does anyone have zookeeper tuning suggestions? Is it even the bottleneck?
>>
>> According to this we should have at most 300 3x replicated per broker:
>> https://www.confluent.io/blog/how-to-choose-the-number-of-
>> topicspartitions-in-a-kafka-cluster/
>> Is anyone doing work to have kafka support more than that?
>>
>> Best regards,
>> Andrey Falko
>> Salesforce.com
>>

Mime
View raw message