kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sophie Blee-Goldman <sop...@confluent.io>
Subject Re: Kafka Streams application stuck rebalancing on startup
Date Wed, 28 Oct 2020 17:45:06 GMT
Yeah there's definitely something weird going on (assuming this is the full
logs over that
time period). The last thing we see logged from the StreamThread is this
message from
around the start of the task assignment process:

2020-10-28 12:22:37.879 DEBUG 27226 --- [-StreamThread-1]
o.a.k.s.p.i.StreamsPartitionAssignor     : stream-thread
[demo-application-81060bdc-c8cc-4350-85f8-d238267e264e-StreamThread-1-consumer]
Constructed client metadata
{81060bdc-c8cc-4350-85f8-d238267e264e=ClientMetadata{hostInfo=null,
consumers=[demo-application-81060bdc-c8cc-4350-85f8-d238267e264e-StreamThread-1-consumer-976853d9-06ad-4515-abf3-2a7398c12006],
state=[activeTasks: ([]) standbyTasks: ([]) assignedTasks: ([])
prevActiveTasks: ([]) prevStandbyTasks: ([]) prevAssignedTasks: ([])
prevOwnedPartitionsByConsumerId: ([]) capacity: 1]}} from the member
subscriptions.



which is at 12:22:37. Then there's nothing else from Streams until at least
12:25:00,
where the logs end. Not sure what it could be doing inside the assignor for
2+ minutes
without ever reaching another...how many partitions are on the input
topics? Are you
sure the input topics have been pre-created before starting the app, with
the correct
names, etc?

On Wed, Oct 28, 2020 at 10:29 AM Alex Jablonski <ajablonski@thoughtworks.com>
wrote:

> Hi Sophie,
>
> Thanks for your questions! Responses inline below. Also, I realized I
> linked to the gradle file, not the interesting bits of the example. This
> <https://github.com/ajablonski/streams-issue-demo/blob/master/src/main/java/com/github/ajablonski/StreamsConfiguration.java>
> is the configuration and this
> <https://github.com/ajablonski/streams-issue-demo/blob/master/src/test/java/com/github/ajablonski/StreamsConfigurationTest.java>
> is the test.
>
> On Tue, Oct 27, 2020 at 10:11 PM Sophie Blee-Goldman <sophie@confluent.io>
> wrote:
>
>> >
>> > We've been able to get the crucial factors that cause this behavior
>> down to
>> > a particular combination
>>
>> What do you mean by this -- that you only see this when all four of those
>> operators
>> are at play? Or do you see it with any of them.
>>
>
> We see this when all four operators are in play. If you change the sample
> streams configuration to not do that final foreign key join, or not use
> custom serdes for example, I don't see the stuck-state issue (the
> application transitions to running state just fine).
>
>
>>
>> I guess the first thing to narrow down is whether it's actually
>> rebalancing
>> or just
>> restoring within this time (the REBALANCING state is somewhat
>> misleadingly-named).
>> If this is a completely new app then it's probably not restoring, but if
>> this app had
>> already been running and building up state before hitting this issue then
>> that's probably
>> the reason. It's not at all uncommon for restoration to take more than 30
>> seconds.
>>
>
> This is happening with the app in a completely new state -- in the test,
> for example, there's no pre-loaded data when we're asserting that the app
> should eventually get to RUNNING, and none of the internal topics exist.
>
>
>> If it really is rebalancing this entire time, then you need to look into
>> the logs to figure
>> out why. I don't see anything obviously wrong with your particular
>> application, and even
>> if there was it should never result in endless rebalances like this. How
>> many instances
>> of the application are you running?
>>
>
> In our actual application, we have 3 instances, but in the tests in that
> sample project, there's only 1.
>
> The logs that we're getting right before the application gets "stuck" are
> below. The one that seems most concerning to my uninformed eyes is "Member
> demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer-cabbd9ce-83a7-4691-8599-b2ffe77da282
> in group demo-application has failed". I've attached some DEBUG level logs
> too, though nothing was obvious to me that would better explain the hanging
> behavior.
>
> 2020-10-28 12:11:19.823  INFO 27127 --- [-StreamThread-1]
> o.a.k.c.c.internals.AbstractCoordinator  : [Consumer
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> groupId=demo-application] Discovered group coordinator localhost:50343 (id:
> 2147483647 rack: null)
> 2020-10-28 12:11:19.825  INFO 27127 --- [-StreamThread-1]
> o.a.k.c.c.internals.AbstractCoordinator  : [Consumer
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> groupId=demo-application] (Re-)joining group
> 2020-10-28 12:11:19.842  WARN 27127 --- [-StreamThread-1]
> org.apache.kafka.clients.NetworkClient   : [Consumer
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> groupId=demo-application] Error while fetching metadata with correlation id
> 7 :
> {demo-application-KTABLE-FK-JOIN-SUBSCRIPTION-RESPONSE-0000000018-topic=UNKNOWN_TOPIC_OR_PARTITION,
> demo-application-GroupName-repartition=UNKNOWN_TOPIC_OR_PARTITION,
> demo-application-KTABLE-FK-JOIN-SUBSCRIPTION-REGISTRATION-0000000010-topic=UNKNOWN_TOPIC_OR_PARTITION}
> 2020-10-28 12:11:19.860  INFO 27127 --- [-StreamThread-1]
> o.a.k.c.c.internals.AbstractCoordinator  : [Consumer
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> groupId=demo-application] Join group failed with
> org.apache.kafka.common.errors.MemberIdRequiredException: The group member
> needs to have a valid member id before actually entering a consumer group
> 2020-10-28 12:11:19.861  INFO 27127 --- [-StreamThread-1]
> o.a.k.c.c.internals.AbstractCoordinator  : [Consumer
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> groupId=demo-application] (Re-)joining group
> 2020-10-28 12:11:19.873  INFO 27127 --- [quest-handler-3]
> k.coordinator.group.GroupCoordinator     : [GroupCoordinator 0]: Preparing
> to rebalance group demo-application in state PreparingRebalance with old
> generation 0 (__consumer_offsets-2) (reason: Adding new member
> demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer-cabbd9ce-83a7-4691-8599-b2ffe77da282
> with group instance id None)
> 2020-10-28 12:11:19.882  INFO 27127 --- [cutor-Rebalance]
> k.coordinator.group.GroupCoordinator     : [GroupCoordinator 0]: Stabilized
> group demo-application generation 1 (__consumer_offsets-2)
> 2020-10-28 12:11:19.947  WARN 27127 --- [-StreamThread-1]
> org.apache.kafka.clients.NetworkClient   : [Consumer
> clientId=demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer,
> groupId=demo-application] Error while fetching metadata with correlation id
> 9 :
> {demo-application-KTABLE-FK-JOIN-SUBSCRIPTION-RESPONSE-0000000018-topic=UNKNOWN_TOPIC_OR_PARTITION,
> demo-application-GroupName-repartition=UNKNOWN_TOPIC_OR_PARTITION,
> demo-application-KTABLE-FK-JOIN-SUBSCRIPTION-REGISTRATION-0000000010-topic=UNKNOWN_TOPIC_OR_PARTITION}
> 2020-10-28 12:11:23.462  INFO 27127 --- [er-event-thread]
> kafka.controller.KafkaController         : [Controller id=0] Processing
> automatic preferred replica leader election
> 2020-10-28 12:11:29.887  INFO 27127 --- [cutor-Heartbeat]
> k.coordinator.group.GroupCoordinator     : [GroupCoordinator 0]: Member
> demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer-cabbd9ce-83a7-4691-8599-b2ffe77da282
> in group demo-application has failed, removing it from the group
> 2020-10-28 12:11:29.887  INFO 27127 --- [cutor-Heartbeat]
> k.coordinator.group.GroupCoordinator     : [GroupCoordinator 0]: Preparing
> to rebalance group demo-application in state PreparingRebalance with old
> generation 1 (__consumer_offsets-2) (reason: removing member
> demo-application-714a21af-5fe5-4b9c-8450-53033309a406-StreamThread-1-consumer-cabbd9ce-83a7-4691-8599-b2ffe77da282
> on heartbeat expiration)
> 2020-10-28 12:11:29.888  INFO 27127 --- [cutor-Heartbeat]
> k.coordinator.group.GroupCoordinator     : [GroupCoordinator 0]: Group
> demo-application with generation 2 is now empty (__consumer_offsets-2)
>
>
>>
>> Cheers,
>> Sophie
>>
>> On Thu, Oct 15, 2020 at 10:01 PM Alex Jablonski <
>> ajablonski@thoughtworks.com>
>> wrote:
>>
>> > Hey there!
>> >
>> > My team and I have run across a bit of a jam in our application where,
>> > given a particular setup, our Kafka Streams application never seems to
>> > start successfully, instead just getting stuck in the REBALANCING state.
>> > We've been able to get the crucial factors that cause this behavior
>> down to
>> > a particular combination of (1) grouping, (2) windowing, (3)
>> aggregating,
>> > and (4) foreign-key joining, with some of those steps specifying Serdes
>> > besides the default.
>> >
>> > It's probably more useful to see a minimal example, so there's one here
>> > <
>> https://github.com/ajablonski/streams-issue-demo/blob/master/build.gradle
>> > >.
>> > The underlying Kafka Streams version is 2.5.1. The first test should
>> show
>> > the application eventually transition to running state, but it doesn't
>> > within the 30 second timeout I've set. Interestingly, getting rid of the
>> > 'Grouped.with' argument to the 'groupBy' function and the
>> > 'Materialized.with' in 'aggregate' in the 'StreamsConfiguration' lets
>> the
>> > application transition to "RUNNING", though without the correct Serdes
>> > that's not too valuable.
>> >
>> > There might be a cleaner way to organize the particular flow in the toy
>> > example, but is there something fundamentally wrong with the approach
>> laid
>> > out in that application that would cause Streams to be stuck in
>> > REBALANCING? I'd appreciate any advice folks could give!
>> >
>> > Thanks!
>> > Alex Jablonski
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message