kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neha Narkhede <neha.narkh...@gmail.com>
Subject Re: Number of feeds, how does it scale?
Date Mon, 09 Apr 2012 19:53:01 GMT
Taylor,

Thanks for sharing the details of your Kafka deployment! I'm wondering
if you could share why you needed to turn off zookeeper registration
on the Kafka brokers. Currently, the Kafka broker creates a node in ZK
when

1. it starts up
2. it receives a message for a new topic

>> Note that we will continue with our current sharded solution, because
> adding Zookeeper back in would likely cause a lot of bottlenecks, not just
> with the startup (all topics are stored in a flat hiearchy in Zookeeper
> too, so it's likely this might not do well at high scale)

Zookeeper can handle 10s of thousands of writes / sec. So are you
saying that topics get created at such a high rate that you get close
to the maximum write throughput that zookeeper can provide ?

Thanks,
Neha

On Mon, Apr 9, 2012 at 12:11 PM, Taylor Gautier <tgautier@tagged.com> wrote:
> We've worked on this problem at Tagged.  In the current implementation a
> current Kafka cluster can handle around 20k simultaneous topics.  This is
> using a RHEL 5 machine with EXT3 backed filesystem, and the bottleneck is
> mostly in the filesystem itself (due to the fact that Kafka stores all
> topics in a single directory).
>
> We have implemented a fast cleaner that deletes topics that are no longer
> used.  This cleaner really wipes the topic, instead of leaving a
> file/directory around as the normal cleaner does.  Since this resets the
> topic offset back to 0, you have to make sure your clients can handle this
> (however, in theory, your clients already need to handle this situation
> since it is a normal part of Kafka should the topic offset wrap around the
> 64-bit value, though in practice it probably never happens).
>
> The other thing of note is that this number of topics with zookeeper turned
> on makes startup intolerably slow as Kafka goes through a validation
> process.  Therefore we also have turned off zookeeper.
>
> To allow us to handle a high number of topics in this configuration, we
> have written a custom sharding/routing on top of Kafka.  The layers look
> like this:
>
> +----------------------------------------------------------------------------+
> | Pub/sub (implements custom sharding and shields from specifics of Kafka)
>  |
> +----------------------------------------------------------------------------+
> | Kafka
>  |
> +----------------------------------------------------------------------------+
>
> Currently we are handling concurrently on the order of 100k topics.  We
> plan to 10x that in the near future.  To do so we are planning on
> implementing the following:
>
>   - SSD backed disks (vs. the current eSATA backed machines)
>   - a topic hierarchy (to alleviate the 20k bottleneck
>
> Note that we will continue with our current sharded solution, because
> adding Zookeeper back in would likely cause a lot of bottlenecks, not just
> with the startup (all topics are stored in a flat hiearchy in Zookeeper
> too, so it's likely this might not do well at high scale).
>
> Hope that helps!
>
> On Mon, Apr 9, 2012 at 11:04 AM, Eric Tschetter <echeddar@gmail.com> wrote:
>
>> Hi guys,
>>
>> I'm wondering about experiences with a large number of feeds created
>> and managed on a single Kafka cluster.  Specifically, if anyone can
>> share information about how many different feeds they have on their
>> kafka cluster and overall throughput, that'd be cool.
>>
>> Some background: I'm planning on setting up a system around Kafka that
>> will (hopefully, eventually) have >10,000 feeds in parallel.  I expect
>> event volume on these feeds to follow a zipfian distribution.  So,
>> there will be a long-tail of smaller feeds and some large ones, but
>> there will be consumers for each of these feeds.  I'm trying to decide
>> between relying on Kafka's feeds to maintain the separation between
>> the data streams, or if I should actually create one large aggregate
>> feed and utilize Kafka's partitioning mechanisms along with some
>> custom logic to keep the feeds separated.  I prefer to use Kafka's
>> built-in feed mechanisms, cause there are significant benefits to
>> that, but I can also imagine a world where that many feeds was not in
>> the base assumptions of how the system would be used and thus
>> questionable around performance.
>>
>> Any input is appreciated.
>>
>> --Eric
>>

Mime
View raw message