kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Espinosa <espi...@gmail.com>
Subject Re: Recommended max number of topics (and data separation)
Date Sun, 28 Jan 2018 16:45:26 GMT
Hi Monty,

I'm also planning to use a big amount of topics in Kafka, so recently I
made a test within a 3 nodes kafka cluster where I created 100k topics with
one partition. Sent 1M messages in total.
These are my conclusions:

   - There is not any limitation on kafka regarding the number of topics
   but on Zookeeper and in the system where Kafka nodes is allocated.
   - Zookeeper will start having problems from 70k topics, which can be
   solved modifying a buffer parameter on the JVM (-Djute.maxbuffer).
   Performance is reduced.
   - Open file descriptors of the system are equivalent to [number of
   topics]X[number of partitions per topic]. Set to 128k in my test to avoid
   problems.
   - System needs a big amount of memory for page caching.

So, after creating 100k with the required setup (system+JVM) but seeing
problems at 70k, I feel safe by not creating more than 50k, and always will
have Zookeeper as my first suspect if a problem comes. I think with proper
resources (memory) and system setup (open file descriptors), you don't have
any real limitation regarding partitions.
By the way, I used long topic names (about 30 characters), which can be
important for ZK.
Hope this information is of your help.

David

2018-01-28 2:22 GMT+01:00 Monty Hindman <montyhindman@gmail.com>:

> I'm designing a system and need some more clarity regarding Kafka's
> recommended limits on the number of topics and/or partitions. At a high
> level, our system would work like this:
>
> - A user creates a job X (X is a UUID).
> - The user uploads data for X to an input topic: X.in.
> - Workers process the data, writing results to an output topic: X.out.
> - The user downloads the data from X.out.
>
> It's important for the system that data for different jobs be kept
> separate, and that input and output data be kept separate. By "separate" I
> mean that there needs to be a reasonable way for users and the system's
> workers to query for the data they need (by job-id and by input-vs-output)
> and not get the data they don't need.
>
> Based on expected usage and our data retention policy, we would not expect
> to need more than 12,000 active jobs at any one time -- in other words,
> 24,000 topics. If we were to have 5 partitions per topic (our cluster has 5
> brokers), that would imply 120,000 partitions. [These number refer only to
> main/primary partitions, not any replicas that might exist.]
>
> Those numbers seem to be far larger than the suggested limits I see online.
> For example, the Kafka FAQ on these matters seems to imply that the most
> relevant limit is the number of partitions (rather than topics) and sort of
> implies that 10,000 partitions might be a suggested guideline (
> https://goo.gl/fQs2md). Also implied is that systems should use fewer
> topics and instead partition the data within topics if further separation
> is needed (the FAQ entry uses the example of partitioning by user ID, which
> is roughly analogous to job ID in my use case).
>
> The guidance in the FAQ is unclear to me:
>
> - Does the suggested limit of 10,000 refer to the total number of
> partitions (ie, main partitions plus any replicas) or just the main
> partitions?
>
> - If the most important limitation is number of partitions (rather than
> number of topics), how does the suggested strategy of using fewer topics
> and then partitioning by some other attribute (ie job ID) help at all?
>
> - Is my use case just a bad fit for Kafka? Or, is there a way for us to use
> Kafka while still supporting the kinds of query patterns that we need (ie,
> by job ID and by input-vs-output)?
>
> Thanks in advance for any guidance.
>
> Monty
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message