kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eno Thereska <eno.there...@gmail.com>
Subject Re: Fixing two critical bugs in kafka streams
Date Sun, 05 Mar 2017 08:19:24 GMT
Thanks Sachin for your contribution. Could you create a pull request out of the commit (so
we can add comments, and also so you are acknowledged properly for your contribution)? 

> On 5 Mar 2017, at 07:34, Sachin Mittal <sjmittal@gmail.com> wrote:
> Hi,
> So far in our experiment we have encountered 2 critical bugs.
> 1. If a thread takes more that MAX_POLL_INTERVAL_MS_CONFIG to compute a
> cycle it gets evicted from group and rebalance takes place and it gets new
> assignment.
> However when this thread tries to commit offsets for the revoked partitions
> in
> onPartitionsRevoked it will again throw the CommitFailedException.
> This gets handled by ConsumerCoordinatorso there is no point to assign this
> exception to
> rebalanceException in StreamThread and stop it. It has already been
> assigned new partitions and it can continue.
> So as fix in case on CommitFailedException I am not killing the StreamThrea.
> 2. Next we see a deadlock state when to process a task it takes longer
> time. Then this threads partitions are assigned to some other thread
> including rocksdb lock. When it tries to process the next task it cannot
> get rocks db lock and simply keeps waiting for that lock forever.
> in retryWithBackoff for AbstractTaskCreator we have a backoffTimeMs = 50L.
> If it does not get lock the we simply increase the time by 10x and keep
> trying inside the while true loop.
> We need to have a upper bound for this backoffTimeM. If the time is greater
> than  MAX_POLL_INTERVAL_MS_CONFIG and it still hasn't got the lock means
> this thread's partitions are moved somewhere else and it may not get the
> lock again.
> So I have added an upper bound check in that while loop.
> The commits are here:
> https://github.com/sjmittal/kafka/commit/6f04327c890c58cab9b1ae108af4ce5c4e3b89a1
> please review and if you feel they make sense, please merge it to main
> branch.
> Thanks
> Sachin

View raw message