kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avanish Mishra <avanish...@yahoo.com.INVALID>
Subject Not able to read committed offset on nodes failures with replication setup.
Date Thu, 25 Feb 2016 06:38:03 GMT
We are running 10 node kafka cluster in test setup with replication factor of 3 and topics
with min.insync.replica as 2.
Recently i noticed that few nodes halted on restart after multiple node failure with FATAL
message:

"Halting because log truncation is not allowed for topic 1613_spam, Current leader 2003's
latest offset 20 is less than replica 2004's latest offset 21 (kafka.server.ReplicaFetcherThread)"
My understanding is that this can happen if there is slow replica in ISR which doesn't have
latest committed message and high water mark. As min.insync.replicas is 2, write will be committed
when it complete on leader and 1 follower. Since replica.lag.time.max.ms setting is 10000,
any slow replica can be in ISR for last 10 sec without fetching any message. if leader goes
down within that interval and slow follower is elected as leader, this will result in new
leader with offset less than the follower.  Is this explanation correct or i am missing something?
What is the best way to recover committed message in such situation?
 
We are running cluster with following settings.
-  replication factor  3-  min.insync.replicas is set to 2.
 -  request.required.acks -1-  unclean.leader.election.enable is set to false-  replica.lag.time.max.ms
is 10000-  replica.high.watermark.checkpoint.interval.ms 1000


Thanks 
Avanish
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message