kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sameer Kumar <sam.kum.w...@gmail.com>
Subject Re: Kafka 11 | Stream Application crashed the brokers
Date Mon, 11 Sep 2017 10:40:00 GMT
Hi Guozhang,

Please find the relevant logs, see a folder for client logs as well, things
started getting awry at 12:42:05.
Let me know if you need any more information.

-Sameer.

On Sun, Sep 10, 2017 at 5:06 PM, Sameer Kumar <sam.kum.work@gmail.com>
wrote:

> Hi Guozhang,
>
> Nope, I was not using exactly-once mode. I dont have the client logs with
> me right now, I will try to replicate it again and share the other details
> with you.
>
> My concern was that it crashed my brokers as well.
>
> -Sameer.
>
> On Sat, Sep 9, 2017 at 1:51 AM, Guozhang Wang <wangguoz@gmail.com> wrote:
>
>> Hello Sameer,
>>
>> I looked through your code, and here is what I figured: in 0.11 version we
>> added the exactly-once feature (
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+
>> Exactly+Once+Delivery+and+Transactional+Messaging
>> )
>>
>> Which uses the transaction log (internal topic named
>> "__transaction_state")
>> that has a default replication of 3 (that will overwrite your global
>> config
>> value of 2). Then at around 12:30, the leader of the transation log
>> partition kicked both replicas of 190 and 192 out of the replica:
>>
>> [2017-09-05 12:30:31,256] INFO [GroupCoordinator 193]: Preparing to
>> rebalance group KafkaCache_TEST15 with old generation 14
>> (__consumer_offsets-27) (kafka.coordinator.group.GroupCoordinator)
>>
>> *[2017-09-05 12:30:41,510] INFO Partition [__transaction_state,9] on
>> broker
>> 193: Shrinking ISR from 193,192,190 to 193 (kafka.cluster.Partition)*
>>
>> *[2017-09-05 12:30:41,513] INFO Partition [__transaction_state,6] on
>> broker
>> 193: Shrinking ISR from 193,190,192 to 193 (kafka.cluster.Partition)*
>>
>> *[2017-09-05 12:30:41,514] INFO Partition [__transaction_state,3] on
>> broker
>> 193: Shrinking ISR from 193,192,190 to 193 (kafka.cluster.Partition)*
>>
>> *[2017-09-05 12:30:41,515] INFO Partition [__transaction_state,18] on
>> broker 193: Shrinking ISR from 193,190,192 to 193
>> (kafka.cluster.Partition)*
>>
>> *[2017-09-05 12:30:41,515] INFO Partition [__transaction_state,15] on
>> broker 193: Shrinking ISR from 193,192,190 to 193
>> (kafka.cluster.Partition)*
>>
>> *[2017-09-05 12:30:41,516] INFO Partition [__transaction_state,12] on
>> broker 193: Shrinking ISR from 193,190,192 to 193
>> (kafka.cluster.Partition)*
>>
>> *[2017-09-05 12:30:41,516] INFO Partition [__consumer_offsets,12] on
>> broker
>> 193: Shrinking ISR from 193,192,190 to 193 (kafka.cluster.Partition)*
>>
>> *[2017-09-05 12:30:41,517] INFO Partition [__consumer_offsets,15] on
>> broker
>> 193: Shrinking ISR from 193,192,190 to 193 (kafka.cluster.Partition)*
>>
>> *[2017-09-05 12:30:41,517] INFO Partition [__transaction_state,24] on
>> broker 193: Shrinking ISR from 193,190,192 to 193
>> (kafka.cluster.Partition)*
>>
>> At the mean time, both replicas of 190 and 192 seems to be timed out on
>> their fetch requests (note the big timestamp gap in the logs):
>>
>> [2017-09-05 12:26:21,130] INFO Rolled new log segment for 'AdServe-4' in 1
>> ms. (kafka.log.Log)
>> [2017-09-05 12:30:59,046] WARN [ReplicaFetcherThread-2-193]: Error in
>> fetch
>> to broker 193, request (type=FetchRequest, replicaId=190, maxWait=500,
>> minBytes=1, maxBytes=10485760, fetchData={__consumer_offsets-
>> 21=(offset=0,
>> logStartOffset=0, maxBytes=1048576)
>>
>> ...
>>
>> [2017-09-05 12:28:37,514] INFO Deleting index
>> /data1/kafka/AdServe-5/00000000000405000294.timeindex.deleted
>> (kafka.log.TimeIndex)
>> [2017-09-05 12:30:59,042] WARN [ReplicaFetcherThread-2-193]: Error in
>> fetch
>> to broker 193, request (type=FetchRequest, replicaId=192, maxWait=500,
>> minBytes=1, maxBytes=10485760, fetchData={__consumer_offsets-
>> 21=(offset=0,
>> logStartOffset=0, maxBytes=1048576)
>>
>>
>>
>> This caused the NotEnoughReplicasException since any appends to the
>> transaction logs are required "acks=all, and min.isr=num.replicas".
>>
>> *[2017-09-05 12:32:11,612] ERROR [Replica Manager on Broker 193]: Error
>> processing append operation on partition __transaction_state-18
>> (kafka.server.ReplicaManager)*
>>
>> *org.apache.kafka.common.errors.NotEnoughReplicasException: Number of
>> insync replicas for partition __transaction_state-18 is [1], below
>> required
>> minimum [3]*
>>
>> Upon seeing this error, the transaction coordinator should retry
>> appending,
>> but if the retry never succeeds it will be blocked. I did not see the
>> Streams API client-side logs and so cannot tell for sure, why this caused
>> the Streams app to fail as well. A quick question: did you enable
>> `processing.mode=exactly-once` on your streams app?
>>
>>
>> Guozhang
>>
>>
>>
>>
>> On Fri, Sep 8, 2017 at 1:34 AM, Sameer Kumar <sam.kum.work@gmail.com>
>> wrote:
>>
>> > Hi All,
>> >
>> >
>> > Any thoughts on the below mail.
>> >
>> > -Sameer.
>> >
>> > On Wed, Sep 6, 2017 at 12:28 PM, Sameer Kumar <sam.kum.work@gmail.com>
>> > wrote:
>> >
>> > > Hi All,
>> > >
>> > > I want to report a scenario wherein my running 2 different instances
>> of
>> > my
>> > > stream application caused my brokers to crash and eventually my stream
>> > > application as well. This scenario only happens when my brokers run on
>> > > Kafka11, everything works fine if my brokers are on Kafka 10..2 and
>> > stream
>> > > application on Kafka11.
>> > >
>> > > I am attaching herewith the logs in a zipped format.
>> > >
>> > > The cluster configuration
>> > > 3 nodes(190,192,193) , Kafka 11
>> > > Topic Replication Factor - 2
>> > >
>> > > App configuration
>> > > Kafka 11 streams.
>> > >
>> > >
>> > > The error I saw on 193 server was org.apache.kafka.common.errors.
>> > NotEnoughReplicasException:
>> > > Number of insync replicas for partition __transaction_state-18 is [1],
>> > > below required minimum [3]. Both 192,190 servers reported errors on
>> > failure
>> > > to read information from 193.
>> > >
>> > > Please look for the time around 12:30-12:32 to find the relevant logs.
>> > Let
>> > > me know if you need some other information.
>> > >
>> > >
>> > > Regards,
>> > > -Sameer.
>> > >
>> >
>>
>>
>>
>> --
>> -- Guozhang
>>
>
>

Mime
View raw message