kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Kreps <jay.kr...@gmail.com>
Subject Re: Number of feeds, how does it scale?
Date Mon, 09 Apr 2012 20:55:51 GMT
Hey Eric,

I think the most topics we have on a single cluster at linkedin is around
300. Our usage is closer to the "one big topic, partitioned by some
relevant key". I am interested in working out any bugs with larger number
of topics, so if you hit anything we can probably help to work through it,
but at least based on our own usage I can't guarantee you won't see
anything.

Things to consider:
- You probably don't want every topic on every machine so that no single
machine needs to have 10k topics. The ZK producer should respect this, but
there isn't a convenient "create topic" command line tool to let you set
the number of machines that host the topic.
- There might be issues around the amount of zk metadata. For example the
thing Taylor mentions where we sequentially process topics...
- Might be good to run it once with hprof or equivalent enabled to make
sure we haven't done anything stupid internally
- One unclean shutdown we run recovery on the last segment of each log. So
if your log segment size is 100MB and you have 10k topics that is like 1TB
of recovery which will be a bit slow. A quick hack fix is to just set the
segment size smaller. A better fix is for kafka to periodically save out
safe recovery points. We had a patch somewhere floating around to do this,
but we haven't gotten it in to trunk, I don't think...

-Jay

On Mon, Apr 9, 2012 at 11:04 AM, Eric Tschetter <echeddar@gmail.com> wrote:

> Hi guys,
>
> I'm wondering about experiences with a large number of feeds created
> and managed on a single Kafka cluster.  Specifically, if anyone can
> share information about how many different feeds they have on their
> kafka cluster and overall throughput, that'd be cool.
>
> Some background: I'm planning on setting up a system around Kafka that
> will (hopefully, eventually) have >10,000 feeds in parallel.  I expect
> event volume on these feeds to follow a zipfian distribution.  So,
> there will be a long-tail of smaller feeds and some large ones, but
> there will be consumers for each of these feeds.  I'm trying to decide
> between relying on Kafka's feeds to maintain the separation between
> the data streams, or if I should actually create one large aggregate
> feed and utilize Kafka's partitioning mechanisms along with some
> custom logic to keep the feeds separated.  I prefer to use Kafka's
> built-in feed mechanisms, cause there are significant benefits to
> that, but I can also imagine a world where that many feeds was not in
> the base assumptions of how the system would be used and thus
> questionable around performance.
>
> Any input is appreciated.
>
> --Eric
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message