kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taylor Gautier <tgaut...@tagged.com>
Subject Re: Number of feeds, how does it scale?
Date Mon, 09 Apr 2012 20:42:54 GMT
During startup, if I recall correctly, there is a validation sequence with
zookeeper enabled.  For each topic on disk I think Kafka checks with
zookeeper to validate that it's valid.  From my recollection, this process
does take a small amount of time, and I think is done serially, so with
5-10k topics the startup would take several minutes instead of several
seconds when zookeeper was disabled.

Before we implemented the runtime cleaner we had to shut down kafka,
manually wipe the topics that were not in use, and then start it back up,
so start up time was critical.

At this time we could turn zookeeper back on, but since each kafka instance
is running in isolation from the others, there isn't any practical reason
it's needed, in fact it just complicates the startup procedure for each
machine as we have to first start zookeeper then kafka.

In all honesty, we were in a mad scramble to just release our
implementation, and make it work, so we didn't work hard to make things
work in an ideal sense.  Since fixing the last issues sometime in early
January it has been running with practically zero effort on our part, so
it's been hard to be incentivized to go in and fix/change anything (or
reconsider any decisions we made) - aka don't fix something that ain't

We have several offshoot kafka use cases coming down the pike, so this
decision as well as others may get some time for reconsideration.

On Mon, Apr 9, 2012 at 12:53 PM, Neha Narkhede <neha.narkhede@gmail.com>wrote:

> Taylor,
> Thanks for sharing the details of your Kafka deployment! I'm wondering
> if you could share why you needed to turn off zookeeper registration
> on the Kafka brokers. Currently, the Kafka broker creates a node in ZK
> when
> 1. it starts up
> 2. it receives a message for a new topic
> >> Note that we will continue with our current sharded solution, because
> > adding Zookeeper back in would likely cause a lot of bottlenecks, not
> just
> > with the startup (all topics are stored in a flat hiearchy in Zookeeper
> > too, so it's likely this might not do well at high scale)
> Zookeeper can handle 10s of thousands of writes / sec. So are you
> saying that topics get created at such a high rate that you get close
> to the maximum write throughput that zookeeper can provide ?
> Thanks,
> Neha
> On Mon, Apr 9, 2012 at 12:11 PM, Taylor Gautier <tgautier@tagged.com>
> wrote:
> > We've worked on this problem at Tagged.  In the current implementation a
> > current Kafka cluster can handle around 20k simultaneous topics.  This is
> > using a RHEL 5 machine with EXT3 backed filesystem, and the bottleneck is
> > mostly in the filesystem itself (due to the fact that Kafka stores all
> > topics in a single directory).
> >
> > We have implemented a fast cleaner that deletes topics that are no longer
> > used.  This cleaner really wipes the topic, instead of leaving a
> > file/directory around as the normal cleaner does.  Since this resets the
> > topic offset back to 0, you have to make sure your clients can handle
> this
> > (however, in theory, your clients already need to handle this situation
> > since it is a normal part of Kafka should the topic offset wrap around
> the
> > 64-bit value, though in practice it probably never happens).
> >
> > The other thing of note is that this number of topics with zookeeper
> turned
> > on makes startup intolerably slow as Kafka goes through a validation
> > process.  Therefore we also have turned off zookeeper.
> >
> > To allow us to handle a high number of topics in this configuration, we
> > have written a custom sharding/routing on top of Kafka.  The layers look
> > like this:
> >
> >
> +----------------------------------------------------------------------------+
> > | Pub/sub (implements custom sharding and shields from specifics of
> Kafka)
> >  |
> >
> +----------------------------------------------------------------------------+
> > | Kafka
> >  |
> >
> +----------------------------------------------------------------------------+
> >
> > Currently we are handling concurrently on the order of 100k topics.  We
> > plan to 10x that in the near future.  To do so we are planning on
> > implementing the following:
> >
> >   - SSD backed disks (vs. the current eSATA backed machines)
> >   - a topic hierarchy (to alleviate the 20k bottleneck
> >
> > Note that we will continue with our current sharded solution, because
> > adding Zookeeper back in would likely cause a lot of bottlenecks, not
> just
> > with the startup (all topics are stored in a flat hiearchy in Zookeeper
> > too, so it's likely this might not do well at high scale).
> >
> > Hope that helps!
> >
> > On Mon, Apr 9, 2012 at 11:04 AM, Eric Tschetter <echeddar@gmail.com>
> wrote:
> >
> >> Hi guys,
> >>
> >> I'm wondering about experiences with a large number of feeds created
> >> and managed on a single Kafka cluster.  Specifically, if anyone can
> >> share information about how many different feeds they have on their
> >> kafka cluster and overall throughput, that'd be cool.
> >>
> >> Some background: I'm planning on setting up a system around Kafka that
> >> will (hopefully, eventually) have >10,000 feeds in parallel.  I expect
> >> event volume on these feeds to follow a zipfian distribution.  So,
> >> there will be a long-tail of smaller feeds and some large ones, but
> >> there will be consumers for each of these feeds.  I'm trying to decide
> >> between relying on Kafka's feeds to maintain the separation between
> >> the data streams, or if I should actually create one large aggregate
> >> feed and utilize Kafka's partitioning mechanisms along with some
> >> custom logic to keep the feeds separated.  I prefer to use Kafka's
> >> built-in feed mechanisms, cause there are significant benefits to
> >> that, but I can also imagine a world where that many feeds was not in
> >> the base assumptions of how the system would be used and thus
> >> questionable around performance.
> >>
> >> Any input is appreciated.
> >>
> >> --Eric
> >>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message