kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Mittal <sjmit...@gmail.com>
Subject Fixing two critical bugs in kafka streams
Date Sun, 05 Mar 2017 07:34:31 GMT
So far in our experiment we have encountered 2 critical bugs.
1. If a thread takes more that MAX_POLL_INTERVAL_MS_CONFIG to compute a
cycle it gets evicted from group and rebalance takes place and it gets new
However when this thread tries to commit offsets for the revoked partitions
onPartitionsRevoked it will again throw the CommitFailedException.

This gets handled by ConsumerCoordinatorso there is no point to assign this
exception to
rebalanceException in StreamThread and stop it. It has already been
assigned new partitions and it can continue.

So as fix in case on CommitFailedException I am not killing the StreamThrea.

2. Next we see a deadlock state when to process a task it takes longer
time. Then this threads partitions are assigned to some other thread
including rocksdb lock. When it tries to process the next task it cannot
get rocks db lock and simply keeps waiting for that lock forever.

in retryWithBackoff for AbstractTaskCreator we have a backoffTimeMs = 50L.
If it does not get lock the we simply increase the time by 10x and keep
trying inside the while true loop.

We need to have a upper bound for this backoffTimeM. If the time is greater
than  MAX_POLL_INTERVAL_MS_CONFIG and it still hasn't got the lock means
this thread's partitions are moved somewhere else and it may not get the
lock again.

So I have added an upper bound check in that while loop.

The commits are here:

please review and if you feel they make sense, please merge it to main


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message