samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 李斯宁 <lisin...@gmail.com>
Subject Re: Samza container hang on exception
Date Fri, 02 Sep 2016 08:03:43 GMT
Can any one help on this? Thanks!

On Thu, Sep 1, 2016 at 11:59 AM, 李斯宁 <lisining@gmail.com> wrote:

> If you cannot see the attachment, please try http://note.youdao.com/
> noteshare?id=56b826c24af47a9fdb600490ce788710
>
> On Thu, Sep 1, 2016 at 1:50 AM, Chinmay Soman <chinmay.cerebro@gmail.com>
> wrote:
>
>> Sorry dont see anything in the attachment. Can you please re-attach and
>> re-send ?
>>
>> On Wed, Aug 31, 2016 at 3:27 AM, 李斯宁 <lisining@gmail.com> wrote:
>>
>> > It seems upgrading does not solve the problem. All task hang in today's
>> > "rush hour".
>> > I attached log and jstack.
>> >
>> > The SAMZA-911 want to fix by stopping the process if failed too much
>> > times.  But the process is still there and hanging.
>> >
>> > On Mon, Aug 22, 2016 at 1:14 PM, 李斯宁 <lisining@gmail.com> wrote:
>> >
>> >> Thanks so much, I'll try.
>> >>
>> >> On Mon, Aug 22, 2016 at 6:26 AM, Yi Pan <nickpan47@gmail.com> wrote:
>> >>
>> >>> Hi, Sining,
>> >>>
>> >>> This is a known bug that is fixed in 0.10.1 (SAMZA-911). Please try
to
>> >>> upgrade to 0.10.1.
>> >>>
>> >>> Thanks!
>> >>>
>> >>> -Yi
>> >>>
>> >>> On Sun, Aug 21, 2016 at 5:55 AM, 李斯宁 <lisining@gmail.com>
wrote:
>> >>>
>> >>> > I have tried restart every kafka server.  The container did not
>> >>> recover.
>> >>> >
>> >>> > log have something below:
>> >>> >
>> >>> > 2016-08-21 20:08:21 [WARN ](o.a.s.s.k.KafkaSystemProducer     
:66
>> )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> > org.apache.kafka.common.errors.NotLeaderForPartitionException:
This
>> >>> server
>> >>> > is not the leader for that topic-partition.. Turn on debugging
to
>> get a
>> >>> > full stack trace
>> >>> > 2016-08-21 20:08:22 [WARN ](o.a.k.c.p.i.Sender                
:257)
>> >>> Got
>> >>> > error produce response with correlation id 4364 on topic-partition
>> >>> > samzaMetrics-5, retrying (0 attempts left). Error:
>> >>> NOT_LEADER_FOR_PARTITION
>> >>> > 2016-08-21 20:08:23 [WARN ](o.a.k.c.p.i.Sender                
:257)
>> >>> Got
>> >>> > error produce response with correlation id 4367 on topic-partition
>> >>> > samzaMetrics-5, retrying (29 attempts left). Error:
>> >>> > NOT_LEADER_FOR_PARTITION
>> >>> >
>> >>> >
>> >>> > jstack shows:
>> >>> >
>> >>> > "main" #1 prio=5 os_prio=0 tid=0x00007f1ba401a000 nid=0x1a621
>> waiting
>> >>> on
>> >>> > condition [0x00007f1bab976000]
>> >>> > java.lang.Thread.State: TIMED_WAITING (sleeping)
>> >>> > at java.lang.Thread.sleep(Native Method)
>> >>> > at
>> >>> > org.apache.samza.util.ExponentialSleepStrategy$RetryLoopStat
>> e.sleep(
>> >>> > ExponentialSleepStrategy.scala:105)
>> >>> > at
>> >>> > org.apache.samza.util.ExponentialSleepStrategy.run(
>> >>> > ExponentialSleepStrategy.scala:91)
>> >>> > at
>> >>> > org.apache.samza.system.kafka.KafkaSystemProducer.send(
>> >>> > KafkaSystemProducer.scala:91)
>> >>> > at org.apache.samza.system.SystemProducers.send(SystemProducers
>> >>> .scala:87)
>> >>> > at
>> >>> > org.apache.samza.task.TaskInstanceCollector.send(
>> >>> > TaskInstanceCollector.scala:61)
>> >>> > at toolbox.analyzer2.realtime.CommonWriter.write(CommonWriter.
>> java:50)
>> >>> > at toolbox.analyzer2.realtime.InitTask.lambda$process$0(InitTas
>> >>> k.java:110)
>> >>> > at toolbox.analyzer2.realtime.InitTask$$Lambda$4/938405008.emit
>> >>> (Unknown
>> >>> > Source)
>> >>> > at
>> >>> > toolbox.analyzer2.util.core.TransToKvProcessor.process(
>> >>> > TransToKvProcessor.java:146)
>> >>> > at toolbox.analyzer2.realtime.InitTask$2.emit(InitTask.java:119)
>> >>> > at toolbox.analyzer2.util.core.JsonExpander.expand(JsonExpander
>> >>> .java:47)
>> >>> > at toolbox.analyzer2.realtime.InitTask.process(InitTask.java:128)
>> >>> > at
>> >>> > org.apache.samza.container.TaskInstance$$anonfun$process$
>> >>> > 1.apply$mcV$sp(TaskInstance.scala:150)
>> >>> > at
>> >>> > org.apache.samza.container.TaskInstanceExceptionHandler.mayb
>> eHandle(
>> >>> > TaskInstanceExceptionHandler.scala:54)
>> >>> > at org.apache.samza.container.TaskInstance.process(TaskInstance
>> >>> .scala:149)
>> >>> > at
>> >>> > org.apache.samza.container.RunLoop$$anonfun$process$1$$
>> >>> > anonfun$apply$mcVJ$sp$2.apply(RunLoop.scala:122)
>> >>> > at
>> >>> > org.apache.samza.container.RunLoop$$anonfun$process$1$$
>> >>> > anonfun$apply$mcVJ$sp$2.apply(RunLoop.scala:119)
>> >>> > at scala.collection.immutable.List.foreach(List.scala:318)
>> >>> > at
>> >>> > org.apache.samza.container.RunLoop$$anonfun$process$1.
>> >>> > apply$mcVJ$sp(RunLoop.scala:118)
>> >>> > at
>> >>> > org.apache.samza.util.TimerUtils$class.updateTimerAndGetDuration(
>> >>> > TimerUtils.scala:51)
>> >>> > at
>> >>> > org.apache.samza.container.RunLoop.updateTimerAndGetDuration(
>> >>> > RunLoop.scala:35)
>> >>> > at org.apache.samza.container.RunLoop.process(RunLoop.scala:106)
>> >>> > at org.apache.samza.container.RunLoop.run(RunLoop.scala:74)
>> >>> > at org.apache.samza.container.SamzaContainer.run(SamzaContainer
>> >>> .scala:553)
>> >>>
>> >>> > at
>> >>> > org.apache.samza.container.SamzaContainer$.safeMain(
>> >>> > SamzaContainer.scala:92)
>> >>> > at org.apache.samza.container.SamzaContainer$.main(
>> >>> > SamzaContainer.scala:66)
>> >>> > at org.apache.samza.container.SamzaContainer.main(SamzaContaine
>> >>> r.scala)
>> >>> >
>> >>> > May be partition leader has changed in rush hour and metrics writing
>> >>> method
>> >>> > do not recognize that and retry again and again?
>> >>> >
>> >>> > Any response is appreciated :)
>> >>> >
>> >>> > On Sun, Aug 21, 2016 at 8:00 PM, 李斯宁 <lisining@gmail.com>
wrote:
>> >>> >
>> >>> > > at the last of the container's log, prints these:
>> >>> > >
>> >>> > > 2016-08-21 19:57:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:57:11 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:57:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:57:31 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:57:41 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:57:51 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:58:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:58:11 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:58:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:58:31 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:58:41 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:58:51 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > > 2016-08-21 19:59:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66 )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > >
>> >>> > >
>> >>> > > On Sun, Aug 21, 2016 at 7:38 PM, 李斯宁 <lisining@gmail.com>
wrote:
>> >>> > >
>> >>> > >> hi, guys
>> >>> > >> I'm using samza in realtime process. After running for
about 10
>> >>> hours,
>> >>> > >> some containers paused and not processing.
>> >>> > >>
>> >>> > >> When I looked into the log, I found a lot of
>> >>> > >>
>> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
>>  :257)
>> >>> > Got error produce response with correlation id 490345 on
>> >>> topic-partition
>> >>> > test3_a2_mobileDictClient_android_uid_imei-3, retrying (17 attempts
>> >>> > left). Error: NOT_LEADER_FOR_PARTITION
>> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
>>  :257)
>> >>> > Got error produce response with correlation id 490345 on
>> >>> topic-partition
>> >>> > test3_a2_mobileDictClient_android_uid_imei-4, retrying (18 attempts
>> >>> > left). Error: NOT_LEADER_FOR_PARTITION
>> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
>>  :257)
>> >>> > Got error produce response with correlation id 490345 on
>> >>> topic-partition
>> >>> > test3_a2_mobileDictClient_android_uid_imei-6, retrying (18 attempts
>> >>> > left). Error: NOT_LEADER_FOR_PARTITION
>> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
>>  :257)
>> >>> > Got error produce response with correlation id 490346 on
>> >>> topic-partition
>> >>> > test3_a2_mobileDictClient_android_uid_imei-3, retrying (16 attempts
>> >>> > left). Error: NOT_LEADER_FOR_PARTITION
>> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
>>  :257)
>> >>> > Got error produce response with correlation id 490346 on
>> >>> topic-partition
>> >>> > test3_a2_mobileDictClient_android_uid_imei-4, retrying (17 attempts
>> >>> > left). Error: NOT_LEADER_FOR_PARTITION
>> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
>>  :257)
>> >>> > Got error produce response with correlation id 490346 on
>> >>> topic-partition
>> >>> > test3_a2_mobileDictClient_android_uid_imei-6, retrying (17 attempts
>> >>> > left). Error: NOT_LEADER_FOR_PARTITION
>> >>> > >>
>> >>> > >> ...
>> >>> > >>
>> >>> > >> 2016-08-21 10:49:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66
>> >>> )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > >> 2016-08-21 10:49:11 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66
>> >>> )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > >> 2016-08-21 10:49:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66
>> >>> )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > >> 2016-08-21 10:49:31 [WARN ](o.a.s.s.k.KafkaSystemProducer
>> :66
>> >>> )
>> >>> > Retrying send messsage due to RetriableException -
>> >>> org.apache.kafka.common.
>> >>> > errors.NotLeaderForPartitionException: This server is not the
>> leader
>> >>> for
>> >>> > that topic-partition.. Turn on debugging to get a full stack trace
>> >>> > >> 2
>> >>> > >>
>> >>> > >> This happens since "rush hour" for new messages produced
to
>> kafka.
>> >>> May
>> >>> > be this is a bug of kafka / samza?
>> >>> > >>
>> >>> > >> kafka version: 0.10.0.0
>> >>> > >>
>> >>> > >> kafka config and part of paused log are attached.
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >
>> >>> > >
>> >>> > > --
>> >>> > > 李斯宁
>> >>> > >
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > 李斯宁
>> >>> >
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> 李斯宁
>> >>
>> >
>> >
>> >
>> > --
>> > 李斯宁
>> >
>>
>>
>>
>> --
>> Thanks and regards
>>
>> Chinmay Soman
>>
>
>
>
> --
> 李斯宁
>



-- 
李斯宁

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message