kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shaun Senecal <shaun.sene...@lithium.com>
Subject Re: number of topics given many consumers and groups within the data
Date Wed, 30 Sep 2015 16:24:53 GMT
Thanks for the link.  I heave come across that at some point in the past, but I dont think
it quite addresses the issue I'm looking at.

I think the custom partitioner strategy doesn't work either though.  The number of groups
we have changes over time, so we can't have a fixed strategy.  We can use hashing and just
create a large number of partitions so that "most of the time" there is only 1 group per partition,
however, as far as I can tell, this is exactly the same as having 1 topic per group (but with
more complexity).  Am I wrong?  I am under the impression that having 1000 topics with 1 partition
incurs the same load/costs on the kafka brokers that 1 topic with 1000 partitions has.


From: Ben Stopford <ben@confluent.io>
Sent: September 30, 2015 9:06 AM
To: users@kafka.apache.org
Subject: Re: number of topics given many consumers and groups within the data

Hi Shaun

You might consider using a custom partition assignment strategy to push your different “groups"
to different partitions. This would allow you walk the middle ground between "all consumers
consume everything” and “one topic per consumer” as you vary the number of partitions
in the topic, albeit at the cost of a little extra complexity.

Also, not sure if you’ve seen it but there is quite a good section in the FAQ here <https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowmanytopicscanIhave?>
on topic and partition sizing.


> On 29 Sep 2015, at 18:48, Shaun Senecal <shaun.senecal@lithium.com> wrote:
> Hi
> I heave read Jay Kreps post regarding the number of topics that can be handled by a broker
(https://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka), and it has left me
with more questions that I dont see answered anywhere else.
> We have a data stream which will be consumed by many consumers (~400).  We also have
many "groups" within our data.  A group in the data corresponds 1:1 with what the consumers
would consume, so consumer A only ever see group A messages, consumer B only consumes group
B messages, etc.
> The downstream consumers will be consuming via a websocket API, so the API server will
be the thing consuming from kafka.
> If I use a single topic with, say, 20 partitions, the consumers in the API server would
need to re-read the same messages over and over for each consumer, which seems like a waste
of network and a potential bottleneck.
> Alternatively, I could use a single topic with 20 partitions and have a single consumer
in the API put the messages into cassandra/redis (as suggested by Jay), and serve out the
downstream consumer streams that way.  However, that requires using a secondary sorted storage,
which seems like a waste (and added complexity) given that Kafka already has the data exactly
as I need it.  Especially if cassandra/redis are required to maintain a long TTL on the stream.
> Finally, I could use 1 topic per group, each with a single partition.  This would result
in 400 topics on the broker, but would allow the API server to simply serve the stream for
each consumer directly from kafka and wont require additional machinery to serve out the requests.
> The 400 topic solution makes the most sense to me (doesnt require extra services, doesnt
waste resources), but seem to conflict with best practices, so I wanted to ask the community
for input.  Has anyone done this before?  What makes the most sense here?
> Thanks
> Shaun

View raw message