kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Mittal <sjmit...@gmail.com>
Subject Re: Getting thread deadlock in streams 10.2.0 if a partition is re-assigned to a different thread in same instance
Date Thu, 02 Mar 2017 17:43:17 GMT
Hi,
It makes sense looks like task 0_4 took more than max poll timeout time.
However I have difficulty processing following lines:

DEBUG 2017-03-01 18:17:42,465 [StreamThread-1]:
org.apache.kafka.streams.processor.internals.StreamThread - stream-thread
[StreamThread-1] creating new task 0_4
...
DEBUG 2017-03-01 18:17:42,590 [StreamThread-1]:
org.apache.kafka.clients.consumer.internals.Fetcher - Handling
ListOffsetResponse response for new-part-advice-key-table-changelog-4.
Fetched offset 8591351, timestamp -1
DEBUG 2017-03-01 18:17:42,590 [StreamThread-1]:
org.apache.kafka.clients.consumer.KafkaConsumer - Seeking to beginning of
partition new-part-advice-key-table-changelog-4
DEBUG 2017-03-01 18:17:42,590 [StreamThread-1]:
org.apache.kafka.clients.consumer.internals.Fetcher - Resetting offset for
partition new-part-advice-key-table-changelog-4 to earliest offset.
DEBUG 2017-03-01 18:17:42,591 [StreamThread-1]:
org.apache.kafka.clients.consumer.internals.Fetcher - Handling
ListOffsetResponse response for new-part-advice-key-table-changelog-4.
Fetched offset 8408559, timestamp -1
DEBUG 2017-03-01 18:17:48,357 [StreamThread-1]:
org.apache.kafka.clients.consumer.internals.Fetcher - Ignoring fetched
records for new-part-advice-key-table-changelog-4 at offset 8408559 since
the current position is 8411374

...
DEBUG 2017-03-01 18:24:18,585 [StreamThread-1]:
org.apache.kafka.clients.consumer.internals.Fetcher - Ignoring fetched
records for new-part-advice-key-table-changelog-4 at offset 8590981 since
the current position is 8591351
DEBUG 2017-03-01 18:24:18,585 [StreamThread-1]:
org.apache.kafka.clients.consumer.internals.Fetcher - Sending fetch for
partitions [new-part-advice-key-table-changelog-4] to broker
192.168.73.199:9092 (id: 5 rack: null)


So when the new task is created it tries to replay the entire change log 4
partition from offset 8408559 to offset 8591351, and the whole process
takes about 7 minutes.

So why does it take so long to read some 180000 records? Is this some
rocksdb issue because it tries to recreate the entire state store.

Also why did it replay from the beginning? Shouldn't it have started from
somewhere in between, place where the previous thread to which this
partition was assigned, committed its offsets?

Thanks
Sachin



On Thu, Mar 2, 2017 at 9:05 PM, Damian Guy <damian.guy@gmail.com> wrote:

> Hi Sachin,
>
> This is because another thread hasn't released the lock. It is generally
> caused because it is taking a long time to restore or process the incoming
> records. It might be helpful to take a Thread dump so we can see what each
> thread is doing.
>
> Thanks,
> Damian
>
> On Thu, 2 Mar 2017 at 12:14 Sachin Mittal <sjmittal@gmail.com> wrote:
>
> > Hi,
> > We are getting some deadlock state after re-balance and what we found
> that
> > it happens after re-balance the partition is assigned to a different
> thread
> > in same instance.
> >
> > Looks like previous thread is not releasing the rocks db lock.
> >
> > Here are the complete logs spanned across multiple files.
> >
> >
> > https://drive.google.com/open?id=0B-KsMy4P1bGxLTVVVWlZOGdQZTUzYm9a
> V3lfOFdubTJ2X2JR
> >
> > I am just posting the gist here as how we understood this issue.
> >
> >
> > *===== TestKafkaAdvice.2017-03-01.89.log ========*
> >
> > 53352 INFO 2017-03-01 18:11:56,980 [StreamThread-3]:
> > org.apache.kafka.clients.consumer.internals.AbstractCoordinator -
> > Successfully joined group new-part-advice with *generation 96*
> >
> > 53353 INFO 2017-03-01 18:11:56,980 [StreamThread-4]:
> > org.apache.kafka.clients.consumer.internals.AbstractCoordinator -
> > Successfully joined group new-part-advice with *generation 96*
> >
> > 53354 INFO 2017-03-01 18:11:56,980 [StreamThread-1]:
> > org.apache.kafka.clients.consumer.internals.AbstractCoordinator -
> > Successfully joined group new-part-advice with *generation 96*
> >
> >
> > 53361 INFO 2017-03-01 *18:11:57,084* [StreamThread-1]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-1]
> > New partitions [[advice-stream-9, advice-strea**m-6]]* assigned at the
> end
> > of consumer rebalance.
> >
> > 53362 INFO 2017-03-01 18:11:57,084 [StreamThread-3]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-3]
> > New partitions [[advice-stream-7]]* assigned at the end of consumer
> > rebalance.
> >
> > 53363 DEBUG 2017-03-01 18:11:57,084 [StreamThread-3]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> stream-thread
> > [StreamThread-3] closing suspended non-assigned task
> >
> > 53364 INFO 2017-03-01 18:11:57,084 [StreamThread-4]:
> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> Setting
> > newly assigned partitions [advice-stream-8, advice-stream-2] fo      r
> > group new-part-advice
> >
> > 53365 INFO 2017-03-01 18:11:57,084 [StreamThread-4]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-4]
> > New partitions [[advice-stream-8, advice-strea**m-2]]* assigned at the
> end
> > of consumer rebalance.
> >
> >
> >
> > 54253 INFO 2017-03-01 18:13:19,146 *[StreamThread-4]*:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-4]
> > Creating active task 0_2 with assigned partiti**ons [[advice-stream-2]]*
> >
> >
> > 56789 INFO 2017-03-01 18:17:42,311 [StreamThread-1]:
> > org.apache.kafka.clients.consumer.internals.AbstractCoordinator -
> > Successfully joined group new-part-advice with *generation 97*
> >
> > 56790 INFO 2017-03-01 18:17:42,311 [StreamThread-3]:
> > org.apache.kafka.clients.consumer.internals.AbstractCoordinator -
> > Successfully joined group new-part-advice with *generation 97*
> >
> >
> > 56796 INFO 2017-03-01 *18:17:42,312* [StreamThread-3]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-3]
> > New partitions [[advice-stream-9, advice-strea**m-6]]* assigned at the
> end
> > of consumer rebalance.
> >
> > 56805 INFO 2017-03-01 18:17:42,312 [StreamThread-1]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-1]
> > New partitions [[advice-stream-8, advice-strea**m-4]]* assigned at the
> end
> > of consumer rebalance.
> >
> >
> >
> > 56823 INFO 2017-03-01 18:17:42,465 *[StreamThread-1]*:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-1]
> > Creating active task 0_4 with assigned partiti**ons [[advice-stream-4]]*
> >
> > 56831 DEBUG 2017-03-01 18:17:42,531 *[StreamThread-1]*:
> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Group
> > new-part-advice *fetching committed offsets for partitions:
> > [advi**ce-stream-8,
> > advice-stream-4]*
> >
> >
> >
> > 57769 INFO 2017-03-01 *18:20:16,291* [StreamThread-4]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-4]
> > Creating active task 0_8 with assigned partiti**ons [[advice-stream-8]]*
> >
> >
> >
> >
> > 57935 INFO 2017-03-01 18:20:19,181 [StreamThread-4]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-4]
> > Removing all active tasks [[0_2, 0_8]]*
> >
> >
> >
> > *======== TestKafkaAdvice.2017-03-01.90.log  ============*
> >
> >
> >  742 INFO 2017-03-01 *18:22:42,697* [StreamThread-3]:
> > org.apache.kafka.clients.consumer.internals.AbstractCoordinator -
> > Successfully joined group new-part-advice with *generation 98*
> >
> >  743 INFO 2017-03-01 18:22:42,697 [StreamThread-4]:
> > org.apache.kafka.clients.consumer.internals.AbstractCoordinator -
> > Successfully joined group new-part-advice with *generation 98*
> >
> >
> > 746 INFO 2017-03-01 18:22:42,697 [StreamThread-3]:
> > org.apache.kafka.streams.
> > processor.internals.StreamThread - stream-thread *[StreamThread-3] New
> > partitions [[advice-stream-9, advice-stream**-4]]* assigned at the end of
> > consumer rebalance.
> >
> > 748 INFO 2017-03-01 18:22:42,712 [StreamThread-4]:
> > org.apache.kafka.streams.
> > processor.internals.StreamThread - stream-thread *[StreamThread-4] New
> > partitions [[advice-stream-8, advice-stream**-2]]* assigned at the end of
> > consumer rebalance.
> >
> > ...............
> >
> > 4502 WARN 2017-03-01 *18:26:23,323* *[StreamThread-3]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - *Could not
> > create task 0_4*. Will retry.
> >
> >
> > ***************************************  Some time later.......*
> >
> >
> > 4444 INFO 2017-03-01 18:25:44,071 [StreamThread-4]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-4]
> > Removing all active tasks [[0_2, 0_8]]*
> >
> >
> > ************************************
> > *Task 0_2 (from line # 54253) moved to a different instance's thread-3
> and
> > Task 0_8 got re-assigned to Thread-4 (line # 4584 below)*
> >
> >
> > 4474 INFO 2017-03-01 18:26:13,443 [StreamThread-1]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread [*StreamThread-1]
> > Creating active task 0_8 with assigned partitio**ns [[advice-stream-8]]*
> >
> > 4475 WARN 2017-03-01 18:26:14,445 *[StreamThread-1]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - *Could not
> > create task 0_8*. Will retry.
> >
> >
> >
> >
> > 4581 INFO 2017-03-01 *18:27:43,048* [StreamThread-4]:
> > org.apache.kafka.clients.consumer.internals.AbstractCoordinator -
> > Successfully joined group new-part-advice with generation 99
> >
> >
> > 4584 INFO 2017-03-01 18:27:43,048 [StreamThread-4]:
> > org.apache.kafka.streams.processor.internals.StreamThread -
> > stream-thread *[StreamThread-4]
> > New partitions [[advice-stream-8, advice-stream**-4]]* assigned at the
> end
> > of consumer rebalance.
> >
> >
> > *************************************  Task Task 0_4 lock's as per line
> > #56823 above still seems to be with Thread-1 !!!*
> >
> >
> > 4625 WARN 2017-03-01 *18:27:44,092* *[StreamThread-4]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - *Could not
> > create task 0_4.* Will retry.
> >
> > 4646 WARN 2017-03-01 18:27:45,145 *[StreamThread-4]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - Could not
> > create task 0_4. Will retry.
> >
> > 4955 WARN 2017-03-01 18:29:37,508 *[StreamThread-4]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - Could not
> > create task 0_4. Will retry.
> >
> > 4982 WARN 2017-03-01 18:29:49,125 *[StreamThread-3]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - Could not
> > create task 0_4. Will retry.
> >
> > 5073 WARN 2017-03-01 18:31:20,911 *[StreamThread-4]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - Could not
> > create task 0_4. Will retry.
> >
> > 5173 WARN 2017-03-01 18:34:46,714 *[StreamThread-4]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - Could not
> > create task 0_4. Will retry.
> >
> > 5199 WARN 2017-03-01 18:36:39,729 *[StreamThread-3]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - Could not
> > create task 0_4. Will retry.
> >
> > 5248 WARN 2017-03-01 18:41:37,329 *[StreamThread-4]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - Could not
> > create task 0_4. Will retry.
> >
> > 5279 WARN 2017-03-01 18:50:19,977 *[StreamThread-3]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - Could not
> > create task 0_4. Will retry.
> >
> >
> > 5397 WARN 2017-03-01 *19:19:13,900* *[StreamThread-1]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - *Could not
> > create task 0_8*. Will retry.
> >
> > 5721 WARN 2017-03-01 *22:03:07,629 [StreamThread-1]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - *Could not
> > create task 0_8*. Will retry.
> >
> >
> > 5698 WARN 2017-03-01 *22:01:32,017* *[StreamThread-3]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - Could not
> > create task 0_4. Will retry.
> >
> > 5746 WARN 2017-03-01 *22:06:29,340* *[StreamThread-4]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - *Could not
> > create task 0_4.* Will retry.
> >
> >
> > *======== TestKafkaAdvice.log  ============*
> >
> >
> > 148 WARN 2017-03-02 *01:41:35,831* *[StreamThread-1]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - *Could not
> > create task 0_8.* Will retry.
> >
> > 735 WARN 2017-03-02 *08:58:31,234* *[StreamThread-1]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - *Could not
> > create task 0_8*. Will retry.
> >
> >
> > 712 WARN 2017-03-02 *08:56:55,810 [StreamThread-3]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - Could not
> > create task 0_4. Will retry.
> >
> > 760 WARN 2017-03-02 *09:01:52,945* *[StreamThread-4]*:
> > org.apache.kafka.streams.processor.internals.StreamThread - *Could not
> > create task 0_4*. Will retry.
> >
> >
> > ***********************************    Finally all threads reach the
> > deadlock state where they cannot create task*
> >
> >
> > Please let us know, why we get such a deadlock state.
> >
> >
> > Thanks
> >
> > Sachin
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message