kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewen Cheslack-Postava <e...@confluent.io>
Subject Re: Kafka Connect gets into a rebalance loop
Date Sun, 18 Dec 2016 00:55:08 GMT
The message

> Wasn't unable to resume work after last rebalance

means that you previous iterations of the rebalance were somehow behind/out
of sync with other members of the group, i.e. they had not read up to the
same point in the config topic so it wouldn't be safe for this worker (or
possibly the entire cluster if this worker was the leader) to resume work.
(I think there's a typo in the log message, it should say "wasn't *able* to
resume work".)

This message indicates the problem:

> Catching up to assignment's config offset.

The leader was using configs that were newer than this member, so it's not
safe for it to start its assigned work since it might be using outdated
configuration. When it tries to catch up, it continues trying to read up
until the end of the config topic, which should be at least as far as the
leader indicated its position was. (Another gap in logging: that message
should really include the offset it is trying to catch up to, although you
can also check that manually since it'll always be trying to read to the
end of the topic.)

This catch up has a timeout which defaults to 3s (which is pretty
substantial given the rate at which configs tend to be written and their
size). The fact that your worker isn't able to catch up probably indicates
a connectivity issue or possibly even some misconfiguration where one
worker is looking at one cluster/config topic, and the other is in the same
group in the same cluster but looking at a different cluster/config topic
when reading configs.

-Ewen

On Fri, Dec 16, 2016 at 3:16 AM, Frank Lyaruu <flyaruu@gmail.com> wrote:

> Hi people,
>
> I've just deployed my Kafka Streams / Connect (I only use a connect sink to
> mongodb) application on a cluster of four instances (4 containers on 2
> machines) and now it seems to get into a sort of rebalancing loop, and I
> don't get much in mongodb, I've got a little bit of data at the beginning,
> but no new data appears.
>
> The rest of the streams application seems to behave.
>
> This is what I get in my log, but at a pretty high speed (about 100 per
> second):
>
> Current config state offset 3 is behind group assignment 5, reading to end
> of config log
> Joined group and got assignment: Assignment{error=0,
> leader='connect-2-8fb3bfc4-93f2-4d08-82df-8e7c4b99ec13', leaderUrl='',
> offset=5, connectorIds=[KNVB-production-generation-99-person-mongosink],
> taskIds=[]}
> Successfully joined group NHV-production-generation-99-person-mongosink
> with generation 6
> Successfully joined group KNVB-production-generation-99-person-mongosink
> with generation 6
> Wasn't unable to resume work after last rebalance, can skip stopping
> connectors and tasks
> Rebalance started
> Wasn't unable to resume work after last rebalance, can skip stopping
> connectors and tasks
> (Re-)joining group KNVB-production-generation-99-person-mongosink
> Current config state offset 3 does not match group assignment 5. Forcing
> rebalance.
> Finished reading to end of log and updated config snapshot, new config log
> offset: 3
> Finished reading to end of log and updated config snapshot, new config log
> offset: 3
> Current config state offset 3 does not match group assignment 5. Forcing
> rebalance.
> Joined group and got assignment: Assignment{error=0,
> leader='connect-1-1893fd59-3ce8-4061-8131-ae36e58f5524', leaderUrl='',
> offset=5, connectorIds=[], taskIds=[]}
> Current config state offset 3 is behind group assignment 5, reading to end
> of config log
> Successfully joined group KNVB-production-generation-99-person-mongosink
> with generation 6
> (Re-)joining group KNVB-production-generation-99-person-mongosink
> Current config state offset 3 does not match group assignment 5. Forcing
> rebalance.Rebalance started
> Current config state offset 3 is behind group assignment 5, reading to end
> of config log
> Catching up to assignment's config offset.
> Successfully joined group NHV-production-generation-99-person-mongosink
> with generation 6
> Joined group and got assignment: Assignment{error=0,
> leader='connect-2-8fb3bfc4-93f2-4d08-82df-8e7c4b99ec13', leaderUrl='',
> offset=5, connectorIds=[], taskIds=[]}
> Catching up to assignment's config offset.
> Joined group and got assignment: Assignment{error=0,
> leader='connect-2-8fb3bfc4-93f2-4d08-82df-8e7c4b99ec13', leaderUrl='',
> offset=5, connectorIds=[], taskIds=[]}
> (Re-)joining group NHV-production-generation-99-person-mongosink
> Wasn't unable to resume work after last rebalance, can skip stopping
> connectors and tasks
> Successfully joined group NHV-production-generation-99-person-mongosink
> with generation 6
> Current config state offset 3 does not match group assignment 5. Forcing
> rebalance.
> Finished reading to end of log and updated config snapshot, new config log
> offset: 3
> Current config state offset 3 does not match group assignment 5. Forcing
> rebalance.
> Rebalance started
>
> ... and so on..
>
> Any ideas?
>
> regards, Frank
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message