kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shimi Kiviti <shim...@gmail.com>
Subject Re: Kafka-streams process stopped processing messages
Date Thu, 04 May 2017 14:45:34 GMT
Thanks Eno,

We still see problems on our side.
when we run kafka-streams 0.10.1.1 eventually the problem goes away but
with 0.10.2.1 it is not.
We see a lot of the rebalancing messages I wrote before

on at least 1 kafka-stream nodes we see disconnection messages like the
following. These messages repeat all the time

2017-05-04 14:25:56,063 [StreamThread-1] INFO
o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for group sa.
2017-05-04 14:25:56,063 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Initiating connection to node 2147483646 at ip-10-0-91-10.ec2.internal:9092.
2017-05-04 14:25:56,091 [StreamThread-1] INFO
o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
o.a.k.s.p.i.StreamPartitionAssignor: stream-thread [StreamThread-1] found
[sa-events] topics possibly matching regex
2017-05-04 14:25:56,096 [StreamThread-1] DEBUG o.a.k.s.p.TopologyBuilder:
stream-thread [StreamThread-1] updating builder with
SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]} topic(s) with po
ssible matching regex subscription(s)
2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
JoinGroupRequest, groupId=sa, sessionTimeout=10000,
rebalanceTimeout=2147483647, memb
erId=, protocolType=consumer,
groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$ProtocolMetadata@2f894d9b))
to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null)
2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector: Created
socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node
2147483646
2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Completed connection to node 2147483646.  Fetching API versions.
2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Initiating API versions fetch from node 2147483646.
2017-05-04 14:25:56,104 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Recorded API versions for node 2147483646: (Produce(0): 0 to 2 [usable: 2],
Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
 Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable: 2],
ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2 [usable:
2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0 [usable:
0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable: 0],
LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
2017-05-04 14:29:44,800 [kafka-producer-network-thread |
sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
o.a.k.c.NetworkClient: Node -2 disconnected.
2017-05-04 14:29:44,801 [kafka-producer-network-thread |
sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
equest, topics=) to node 1
2017-05-04 14:29:44,801 [kafka-producer-network-thread |
sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
o.a.k.c.NetworkClient: Node -1 disconnected.
2017-05-04 14:29:44,802 [kafka-producer-network-thread |
sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer] DEBUG
o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
(id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:9092 (id:
1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null)],
partitions = [])
2017-05-04 14:30:56,062 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata: Updated
cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA, nodes =
[ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
-0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
[Partition(topic = sa-events, partition = 0, leader = 1, replicas = [1,2],
isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
der = 2, replicas = [1,2], isr = [2,1]), Partition(topic = sa-events,
partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
2017-05-04 14:31:06,085 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Disconnecting from node 2147483646 due to request timeout.
2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
{api_key=11,api_version=1,correlation_id=16,client_id=sa-5788b5a5-aadc-4276-916f
-1640008c17da-StreamThread-1-consumer} with correlation id 16 due to node
2147483646 being disconnected
2017-05-04 14:31:06,086 [StreamThread-1] INFO
o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead for group
sa
2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request for group
sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
2017-05-04 14:31:06,196 [StreamThread-1] DEBUG o.a.k.c.NetworkClient:
Sending metadata request (type=MetadataRequest, topics=<ALL>) to node 2
2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator response
ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
disconnected=false, requestHeader=
{api_key=10,api_version=0,correlation_id=19,client_id=sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
responseBody={error_code=0,coordinator={node_id=1,host=ip-10-0-91-10.ec
2.internal,port=9092}}) for group sa


On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <eno.thereska@gmail.com> wrote:

> Hi Shimi,
>
> 0.10.2.1 contains a number of fixes that should make the out of box
> experience better, including resiliency under broker failures and better
> exception handling. If you ever get back to it, and if the problem happens
> again, please do send us the logs and we'll happily have a look.
>
> Thanks
> Eno
> > On 1 May 2017, at 12:05, Shimi Kiviti <shimi.k@gmail.com> wrote:
> >
> > Hi Eno,
> > I am afraid I played too much with the configuration to make this
> > productive investigation :(
> >
> > This is a QA environment which includes 2 kafka instances and 3 zookeeper
> > instances in AWS. There are only 3 partition for this topic.
> > Kafka broker and kafka-stream are version 0.10.1.1
> > Our kafka-stream app run on docker using kubernetes.
> > I played around with with 1 to 3  kafka-stream processes, but I got the
> > same results. It is too easy to scale with kubernetes :)
> > Since there are only 3 partitions, I didn't start more then 3 instances.
> >
> > I was too quick to upgraded only the kafka-stream app to 0.10.2.1 with
> hope
> > that it will solve the problem, It didn't.
> > The log I sent before are from this version.
> >
> > I did notice "unknown" offset for the main topic with kafka-stream
> version
> > 0.10.2.1
> > $ ./bin/kafka-consumer-groups.sh   --bootstrap-server localhost:9092
> > --describe --group sa
> > GROUP                          TOPIC                          PARTITION
> > CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> > sa             sa-events                 0          842199
> > 842199          0
> > sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> 10.0.10.9
> > sa             sa-events                 1          1078428
> > 1078428         0
> > sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> 10.0.10.9
> > sa             sa-events                 2          unknown
> > 26093910        unknown
> > sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-consumer_/
> 10.0.10.9
> >
> > After that I downgraded the kafka-stream app back to version 0.10.1.1
> > After a LONG startup time (more than an hour) where the status of the
> group
> > was rebalancing, all the 3 processes started processing messages again.
> >
> > This all thing started after we hit a bug in our code (NPE) that crashed
> > the stream processing thread.
> > So now after 4 days, everything is back to normal.
> > This worries me since it can happen again
> >
> >
> > On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <eno.thereska@gmail.com>
> > wrote:
> >
> >> Hi Shimi,
> >>
> >> Could you provide more info on your setup? How many kafka streams
> >> processes do you have and from how many partitions are they consuming
> from.
> >> If you have more processes than partitions some of the processes will be
> >> idle and won’t do anything.
> >>
> >> Eno
> >>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <shimi.k@gmail.com> wrote:
> >>>
> >>> Hi Everyone,
> >>>
> >>> I have a problem and I hope one of you can help me figuring it out.
> >>> One of our kafka-streams processes stopped processing messages
> >>>
> >>> When I turn on debug log I see lots of these messages:
> >>>
> >>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> >> Sending
> >>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
> >>> .ec2.internal:9092 (id: 1 rack: null)
> >>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG o.a.k.c.c.i.Fetcher:
> >>> Ignoring fetched records for devlast-changelog-2 at offset 2962649
> since
> >>> the current position is 2963379
> >>>
> >>> After a LONG time, the only messages in the log are these:
> >>>
> >>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread | sa]
> DEBUG
> >>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for group
> sa
> >> to
> >>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack: null)
> >>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread | sa]
> DEBUG
> >>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
> response
> >> for
> >>> group same
> >>>
> >>> Any idea?
> >>>
> >>> Thanks,
> >>> Shimi
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message