kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Elliott <aelli...@salesforce.com>
Subject Reliable topic deletion in multi-tenant environment
Date Mon, 01 Oct 2018 21:47:54 GMT

My team runs a multi-tenant Kafka cluster with a lot of diverse uses, and
one of the services we provide is an API for managed topic
creation/deletion. The cluster is large (> 100 nodes) and so it's pretty
likely that, for whatever reason, at least one node will be down at any
given point--and sometimes for extended periods.

We're currently struggling with Kafka's behaviour when an under-replicated
topic is deleted. The topic is "marked for deletion", the partitions go
offline, but the deletion operation blocks until the missing node is
brought back online. From what I can see in the source, this is intentional

We've had two problems with this:

   1. Certain customers want to be able to delete and re-create topics
   quickly. Until the pending deletions are resolved, they are stuck.
   2. If too many pending deletions are queued up (e.g. from bulk topic
   maintenance), something overflows. I haven't dug too much into this, but it
   ends up crashing the Controller.

#2 has caused us lots of pain in the past, but since we determine when bulk
maintenance happens, we can simply avoid starting when topics are

#1 is more of a problem right now, and unfortunately, a solution like
making topic names unique isn't going to work for our customers.

One thing I've noticed is that the JSON-powered partition reassignment can
be used to avoid this, by reassigning partitions away from the down Broker
prior to executing the delete. This mechanism isn't suitable for on-demand
topic deletion, but I looked into the source. Partition reassignment
executes the delete state machine workflow for those partitions that need
to be removed, but in this case, it skips all of the various checks that
the normal delete workflow does about offline hosts.

Does anyone have any advice on forcing a topic deletion to complete, even
if it's under-replicated? If the offline-Broker checks are important, why
does the partition reassignment workflow skip them? What would be the
impact of making the standard topic deletion also skip the checks?


Adam Elliott
Software Engineer, Salesforce

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message