kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Chu <david....@appdynamics.com>
Subject Re: Kafka Streams Application Failing to Start Due to State Store Recovery Time Exceeding Producer Transaction Timeout
Date Tue, 10 Jul 2018 18:18:30 GMT
Yes, https://issues.apache.org/jira/browse/KAFKA-6634 <https://issues.apache.org/jira/browse/KAFKA-6634>
seems to explain the issue I’m seeing; however, I’m running Kafka and Kafka Streams on
version 1.1.0 so I wonder why this issue is still occurring?

-David

> On Jul 10, 2018, at 9:38 AM, Matthias J. Sax <matthias@confluent.io> wrote:
> 
> Can it be, that you hit: https://issues.apache.org/jira/browse/KAFKA-6634
> 
> -Matthias
> 
> On 7/9/18 7:58 PM, David Chu wrote:
>> I have a Kafka Streams application which is currently failing to start due to the
following ProducerFencedException:
>> 
>> "Caused by: org.apache.kafka.common.errors.ProducerFencedException: task [0_57] Abort
sending since producer got fenced with a previous record (key ABCD value [B@4debf146 timestamp
1531159392586) to topic my-stream-1-store-changelog due to Producer attempted an operation
with an old epoch. Either there is a newer producer with the same transactionalId, or the
producer's transaction has been expired by the broker.”
>> 
>> My stream application has exactly-once processing enabled and also has a state store
with a logging enabled.  The application had been running for some time but was recently shutdown
and now when I try to start it back up, it always fails due to ProducerFencedExceptions like
the one shown above.  From what I can tell, these exceptions are occurring because the producer
transactions are timing out causing their transactionId to become invalid.  I believe the
producer transactions are timing out due to the recovery of the state store taking longer
than the 1 minute default transaction timeout period.  My reasoning for this is that when
I look at the Kafka Broker logs I see the following sequence of events:
>> 
>> 1. The Kafka Streams application is started and I see the following logs appear in
the Kafka Broker indicating the producer transactions have been initialized:
>> 
>> "[2018-07-10T01:34:21,112Z]  [INFO ]  [kafka-request-handler-0]  [k.c.t.TransactionCoordinator]
 [TransactionCoordinator id=79213818] Initialized transactionalId my-stream-1-0_37 with producerId
6011 and producer epoch 33 on partition __transaction_state-41”
>> 
>> 2. When I go back to the Kafka Streams application logs I can see that the stream
threads are still recovering their state stores from the changelog topic due to the following
log messages:
>> 
>> "[2018-07-10T01:34:23,164Z]  [INFO ]  [my-stream-1-755e7bc7-831d-4d3f-8d4c-2d2641095afa-StreamThread-5]
 [c.a.a.s.k.s.StateRestorationMonitor]  Starting restoration of topic [my-stream-1-store-changelog]
partition [27] for state store [store] with starting offset [0] and ending offset [2834487]"
>> 
>> 3. Over a minute goes by and state store restoration is still taking place and then
I see the following log messages appear in the Kafka Broker:
>> 
>> "[2018-07-10T01:36:29,542Z]  [INFO ]  [kafka-request-handler-4]  [k.c.t.TransactionCoordinator]
 [TransactionCoordinator id=79213818] Completed rollback ongoing transaction of transactionalId:
my-stream-1-0_37 due to timeout”
>> 
>> "[2018-07-10T01:36:48,387Z]  [ERROR]  [kafka-request-handler-5]  [kafka.server.ReplicaManager]
 [ReplicaManager broker=79213818] Error processing append operation on partition my-stream-1-store-changelog-37
>> org.apache.kafka.common.errors.ProducerFencedException: Producer's epoch is no longer
valid. There is probably another producer with a newer epoch. 33 (request epoch), 34 (server
epoch)”
>> 
>> 4. Soon after that the Kafka Streams application transitions into the ERROR state
and does not recover. 
>> 
>> So from what I can tell it appears that the producer transactions are timing out
because the state store recovery process is taking over a minute to complete, and while the
recovery is taking place the stream threads are not committing their transactions.  If this
is the case, I wonder if it would make sense to not begin the producer transactions until
after the state store recovery has completed?  This would help to prevent long state store
recoveries from potentially causing the transactions to time out.
>> 
>> Thanks,
>> David
>> 
>> 
>> 
>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message