kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taylor Gautier <tgaut...@tagged.com>
Subject Re: Number of feeds, how does it scale?
Date Mon, 09 Apr 2012 19:11:52 GMT
We've worked on this problem at Tagged.  In the current implementation a
current Kafka cluster can handle around 20k simultaneous topics.  This is
using a RHEL 5 machine with EXT3 backed filesystem, and the bottleneck is
mostly in the filesystem itself (due to the fact that Kafka stores all
topics in a single directory).

We have implemented a fast cleaner that deletes topics that are no longer
used.  This cleaner really wipes the topic, instead of leaving a
file/directory around as the normal cleaner does.  Since this resets the
topic offset back to 0, you have to make sure your clients can handle this
(however, in theory, your clients already need to handle this situation
since it is a normal part of Kafka should the topic offset wrap around the
64-bit value, though in practice it probably never happens).

The other thing of note is that this number of topics with zookeeper turned
on makes startup intolerably slow as Kafka goes through a validation
process.  Therefore we also have turned off zookeeper.

To allow us to handle a high number of topics in this configuration, we
have written a custom sharding/routing on top of Kafka.  The layers look
like this:

+----------------------------------------------------------------------------+
| Pub/sub (implements custom sharding and shields from specifics of Kafka)
  |
+----------------------------------------------------------------------------+
| Kafka
 |
+----------------------------------------------------------------------------+

Currently we are handling concurrently on the order of 100k topics.  We
plan to 10x that in the near future.  To do so we are planning on
implementing the following:

   - SSD backed disks (vs. the current eSATA backed machines)
   - a topic hierarchy (to alleviate the 20k bottleneck

Note that we will continue with our current sharded solution, because
adding Zookeeper back in would likely cause a lot of bottlenecks, not just
with the startup (all topics are stored in a flat hiearchy in Zookeeper
too, so it's likely this might not do well at high scale).

Hope that helps!

On Mon, Apr 9, 2012 at 11:04 AM, Eric Tschetter <echeddar@gmail.com> wrote:

> Hi guys,
>
> I'm wondering about experiences with a large number of feeds created
> and managed on a single Kafka cluster.  Specifically, if anyone can
> share information about how many different feeds they have on their
> kafka cluster and overall throughput, that'd be cool.
>
> Some background: I'm planning on setting up a system around Kafka that
> will (hopefully, eventually) have >10,000 feeds in parallel.  I expect
> event volume on these feeds to follow a zipfian distribution.  So,
> there will be a long-tail of smaller feeds and some large ones, but
> there will be consumers for each of these feeds.  I'm trying to decide
> between relying on Kafka's feeds to maintain the separation between
> the data streams, or if I should actually create one large aggregate
> feed and utilize Kafka's partitioning mechanisms along with some
> custom logic to keep the feeds separated.  I prefer to use Kafka's
> built-in feed mechanisms, cause there are significant benefits to
> that, but I can also imagine a world where that many feeds was not in
> the base assumptions of how the system would be used and thus
> questionable around performance.
>
> Any input is appreciated.
>
> --Eric
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message