kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@gmail.com>
Subject Re: Race when broker reconnects to zk?
Date Sat, 24 Dec 2011 06:39:34 GMT
Dan,

The only case when we try to create a new ephemeral node in ZK (broker id
and consumer id) is when the ZK session has expired. When a session
expires, all ephemeral nodes created in that session should have been
automatically deleted by ZK server. So, this shouldn't cause node conflicts.

The check in ZkUtils.createEphemeralPathExpectConflict is for the case when
the client tries to create an ephemeral node but gets an exception because
the client gets disconnected (but session is not expired yet). When this
happens, it's not clear whether the creation actually succeeded or not. The
extra check is to validate if the creation by that client is indeed
successful.

There were a couple of bugs in ZK 3.3.3 that could cause ephemeral nodes to
be lost. That could cause the problem you have seen. Those issues have been
fixed in ZK 3.3.4.

Thanks,

Jun

On Fri, Dec 23, 2011 at 12:48 PM, Dan Brown <dan@metamx.com> wrote:

> Hi all, I'm concerned that there's an unsafe race when a broker loses
> and reestablishes its zk connection, and I'd like others to weigh in.
>
> On ZookeeperConsumerConnector:204, registerConsumerInZK calls
> ZkUtils.createEphemeralPathExpectConflict, which on ZkUtils:89 has a
> case where it observes that the node and data it wants to create
> already exist, and it considers this a success and returns normally.
> But isn't it possible for that already created node to be a stale
> ephemeral node that is about to disappear, in which case the broker
> will lose its ephemeral /brokers/ids node and consumers won't be able
> to find it? In particular, wouldn't this occur when the broker gets
> disconnected from zk, reconnects with a new session, and tries to
> recreate its ephemeral node before zk has timed out the ephemeral node
> from its previous session?
>
> I'm seeing a behavior where one of our brokers was running but had no
> /brokers/ids node, and the logs indicated that it reconnected to zk
> recently, and I'm suspecting this as the explanation. (To fix it, I
> just restarted the broker.) I'm running an old RC for kafka-0.6, but
> looking at the latest code (from the git mirror) it looks like the
> code path described above is still the same as what we're running.
>
>  Dan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message