samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 李斯宁 <lisin...@gmail.com>
Subject Re: Samza container hang on exception
Date Fri, 02 Sep 2016 09:17:26 GMT
yes, upgraded to 0.10.1

jstack:
https://drive.google.com/open?id=0B19olQZ1dUO8VjltQmtxLTJ4SVdFZWhYWHZ3Y2hMOVhCMWNn
task log:
https://drive.google.com/open?id=0B19olQZ1dUO8eVRLWmJCVl9nRlg2UUM4c21udUViWW8tSUVV

On Fri, Sep 2, 2016 at 4:41 PM, Yi Pan <nickpan47@gmail.com> wrote:

> Hi, Sining,
>
> You note is on a site that I don't have account/access and it requires
> sign-up. Can you share it via google doc, since you have a gmail account?
> And just to confirm, you have upgrade and using 0.10.1 now, right?
>
> Thanks and apologize for the delay.
>
> -Yi
>
> On Fri, Sep 2, 2016 at 1:03 AM, 李斯宁 <lisining@gmail.com> wrote:
>
> > Can any one help on this? Thanks!
> >
> > On Thu, Sep 1, 2016 at 11:59 AM, 李斯宁 <lisining@gmail.com> wrote:
> >
> > > If you cannot see the attachment, please try http://note.youdao.com/
> > > noteshare?id=56b826c24af47a9fdb600490ce788710
> > >
> > > On Thu, Sep 1, 2016 at 1:50 AM, Chinmay Soman <
> chinmay.cerebro@gmail.com
> > >
> > > wrote:
> > >
> > >> Sorry dont see anything in the attachment. Can you please re-attach
> and
> > >> re-send ?
> > >>
> > >> On Wed, Aug 31, 2016 at 3:27 AM, 李斯宁 <lisining@gmail.com> wrote:
> > >>
> > >> > It seems upgrading does not solve the problem. All task hang in
> > today's
> > >> > "rush hour".
> > >> > I attached log and jstack.
> > >> >
> > >> > The SAMZA-911 want to fix by stopping the process if failed too much
> > >> > times.  But the process is still there and hanging.
> > >> >
> > >> > On Mon, Aug 22, 2016 at 1:14 PM, 李斯宁 <lisining@gmail.com>
wrote:
> > >> >
> > >> >> Thanks so much, I'll try.
> > >> >>
> > >> >> On Mon, Aug 22, 2016 at 6:26 AM, Yi Pan <nickpan47@gmail.com>
> wrote:
> > >> >>
> > >> >>> Hi, Sining,
> > >> >>>
> > >> >>> This is a known bug that is fixed in 0.10.1 (SAMZA-911). Please
> try
> > to
> > >> >>> upgrade to 0.10.1.
> > >> >>>
> > >> >>> Thanks!
> > >> >>>
> > >> >>> -Yi
> > >> >>>
> > >> >>> On Sun, Aug 21, 2016 at 5:55 AM, 李斯宁 <lisining@gmail.com>
wrote:
> > >> >>>
> > >> >>> > I have tried restart every kafka server.  The container
did not
> > >> >>> recover.
> > >> >>> >
> > >> >>> > log have something below:
> > >> >>> >
> > >> >>> > 2016-08-21 20:08:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > :66
> > >> )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> > org.apache.kafka.common.errors.NotLeaderForPartitionException:
> > This
> > >> >>> server
> > >> >>> > is not the leader for that topic-partition.. Turn on
debugging
> to
> > >> get a
> > >> >>> > full stack trace
> > >> >>> > 2016-08-21 20:08:22 [WARN ](o.a.k.c.p.i.Sender
> >  :257)
> > >> >>> Got
> > >> >>> > error produce response with correlation id 4364 on
> topic-partition
> > >> >>> > samzaMetrics-5, retrying (0 attempts left). Error:
> > >> >>> NOT_LEADER_FOR_PARTITION
> > >> >>> > 2016-08-21 20:08:23 [WARN ](o.a.k.c.p.i.Sender
> >  :257)
> > >> >>> Got
> > >> >>> > error produce response with correlation id 4367 on
> topic-partition
> > >> >>> > samzaMetrics-5, retrying (29 attempts left). Error:
> > >> >>> > NOT_LEADER_FOR_PARTITION
> > >> >>> >
> > >> >>> >
> > >> >>> > jstack shows:
> > >> >>> >
> > >> >>> > "main" #1 prio=5 os_prio=0 tid=0x00007f1ba401a000 nid=0x1a621
> > >> waiting
> > >> >>> on
> > >> >>> > condition [0x00007f1bab976000]
> > >> >>> > java.lang.Thread.State: TIMED_WAITING (sleeping)
> > >> >>> > at java.lang.Thread.sleep(Native Method)
> > >> >>> > at
> > >> >>> > org.apache.samza.util.ExponentialSleepStrategy$RetryLoopStat
> > >> e.sleep(
> > >> >>> > ExponentialSleepStrategy.scala:105)
> > >> >>> > at
> > >> >>> > org.apache.samza.util.ExponentialSleepStrategy.run(
> > >> >>> > ExponentialSleepStrategy.scala:91)
> > >> >>> > at
> > >> >>> > org.apache.samza.system.kafka.KafkaSystemProducer.send(
> > >> >>> > KafkaSystemProducer.scala:91)
> > >> >>> > at org.apache.samza.system.SystemProducers.send(SystemProducers
> > >> >>> .scala:87)
> > >> >>> > at
> > >> >>> > org.apache.samza.task.TaskInstanceCollector.send(
> > >> >>> > TaskInstanceCollector.scala:61)
> > >> >>> > at toolbox.analyzer2.realtime.CommonWriter.write(CommonWriter.
> > >> java:50)
> > >> >>> > at toolbox.analyzer2.realtime.InitTask.lambda$process$0(InitTas
> > >> >>> k.java:110)
> > >> >>> > at toolbox.analyzer2.realtime.InitTask$$Lambda$4/938405008.emit
> > >> >>> (Unknown
> > >> >>> > Source)
> > >> >>> > at
> > >> >>> > toolbox.analyzer2.util.core.TransToKvProcessor.process(
> > >> >>> > TransToKvProcessor.java:146)
> > >> >>> > at toolbox.analyzer2.realtime.InitTask$2.emit(InitTask.java:
> 119)
> > >> >>> > at toolbox.analyzer2.util.core.JsonExpander.expand(JsonExpander
> > >> >>> .java:47)
> > >> >>> > at toolbox.analyzer2.realtime.InitTask.process(InitTask.
> java:128)
> > >> >>> > at
> > >> >>> > org.apache.samza.container.TaskInstance$$anonfun$process$
> > >> >>> > 1.apply$mcV$sp(TaskInstance.scala:150)
> > >> >>> > at
> > >> >>> > org.apache.samza.container.TaskInstanceExceptionHandler.mayb
> > >> eHandle(
> > >> >>> > TaskInstanceExceptionHandler.scala:54)
> > >> >>> > at org.apache.samza.container.TaskInstance.process(TaskInstance
> > >> >>> .scala:149)
> > >> >>> > at
> > >> >>> > org.apache.samza.container.RunLoop$$anonfun$process$1$$
> > >> >>> > anonfun$apply$mcVJ$sp$2.apply(RunLoop.scala:122)
> > >> >>> > at
> > >> >>> > org.apache.samza.container.RunLoop$$anonfun$process$1$$
> > >> >>> > anonfun$apply$mcVJ$sp$2.apply(RunLoop.scala:119)
> > >> >>> > at scala.collection.immutable.List.foreach(List.scala:318)
> > >> >>> > at
> > >> >>> > org.apache.samza.container.RunLoop$$anonfun$process$1.
> > >> >>> > apply$mcVJ$sp(RunLoop.scala:118)
> > >> >>> > at
> > >> >>> > org.apache.samza.util.TimerUtils$class.
> updateTimerAndGetDuration(
> > >> >>> > TimerUtils.scala:51)
> > >> >>> > at
> > >> >>> > org.apache.samza.container.RunLoop.updateTimerAndGetDuration(
> > >> >>> > RunLoop.scala:35)
> > >> >>> > at org.apache.samza.container.RunLoop.process(RunLoop.scala:
> 106)
> > >> >>> > at org.apache.samza.container.RunLoop.run(RunLoop.scala:74)
> > >> >>> > at org.apache.samza.container.SamzaContainer.run(SamzaContainer
> > >> >>> .scala:553)
> > >> >>>
> > >> >>> > at
> > >> >>> > org.apache.samza.container.SamzaContainer$.safeMain(
> > >> >>> > SamzaContainer.scala:92)
> > >> >>> > at org.apache.samza.container.SamzaContainer$.main(
> > >> >>> > SamzaContainer.scala:66)
> > >> >>> > at org.apache.samza.container.SamzaContainer.main(SamzaContaine
> > >> >>> r.scala)
> > >> >>> >
> > >> >>> > May be partition leader has changed in rush hour and
metrics
> > writing
> > >> >>> method
> > >> >>> > do not recognize that and retry again and again?
> > >> >>> >
> > >> >>> > Any response is appreciated :)
> > >> >>> >
> > >> >>> > On Sun, Aug 21, 2016 at 8:00 PM, 李斯宁 <lisining@gmail.com>
> wrote:
> > >> >>> >
> > >> >>> > > at the last of the container's log, prints these:
> > >> >>> > >
> > >> >>> > > 2016-08-21 19:57:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:57:11 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:57:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:57:31 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:57:41 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:57:51 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:58:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:58:11 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:58:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:58:31 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:58:41 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:58:51 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > > 2016-08-21 19:59:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66 )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > >
> > >> >>> > >
> > >> >>> > > On Sun, Aug 21, 2016 at 7:38 PM, 李斯宁 <lisining@gmail.com>
> > wrote:
> > >> >>> > >
> > >> >>> > >> hi, guys
> > >> >>> > >> I'm using samza in realtime process. After running
for about
> 10
> > >> >>> hours,
> > >> >>> > >> some containers paused and not processing.
> > >> >>> > >>
> > >> >>> > >> When I looked into the log, I found a lot of
> > >> >>> > >>
> > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > >>  :257)
> > >> >>> > Got error produce response with correlation id 490345
on
> > >> >>> topic-partition
> > >> >>> > test3_a2_mobileDictClient_android_uid_imei-3, retrying
(17
> > attempts
> > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > >>  :257)
> > >> >>> > Got error produce response with correlation id 490345
on
> > >> >>> topic-partition
> > >> >>> > test3_a2_mobileDictClient_android_uid_imei-4, retrying
(18
> > attempts
> > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > >>  :257)
> > >> >>> > Got error produce response with correlation id 490345
on
> > >> >>> topic-partition
> > >> >>> > test3_a2_mobileDictClient_android_uid_imei-6, retrying
(18
> > attempts
> > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > >>  :257)
> > >> >>> > Got error produce response with correlation id 490346
on
> > >> >>> topic-partition
> > >> >>> > test3_a2_mobileDictClient_android_uid_imei-3, retrying
(16
> > attempts
> > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > >>  :257)
> > >> >>> > Got error produce response with correlation id 490346
on
> > >> >>> topic-partition
> > >> >>> > test3_a2_mobileDictClient_android_uid_imei-4, retrying
(17
> > attempts
> > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > >>  :257)
> > >> >>> > Got error produce response with correlation id 490346
on
> > >> >>> topic-partition
> > >> >>> > test3_a2_mobileDictClient_android_uid_imei-6, retrying
(17
> > attempts
> > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > >> >>> > >>
> > >> >>> > >> ...
> > >> >>> > >>
> > >> >>> > >> 2016-08-21 10:49:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66
> > >> >>> )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > >> 2016-08-21 10:49:11 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66
> > >> >>> )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > >> 2016-08-21 10:49:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66
> > >> >>> )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > >> 2016-08-21 10:49:31 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > >> :66
> > >> >>> )
> > >> >>> > Retrying send messsage due to RetriableException -
> > >> >>> org.apache.kafka.common.
> > >> >>> > errors.NotLeaderForPartitionException: This server is
not the
> > >> leader
> > >> >>> for
> > >> >>> > that topic-partition.. Turn on debugging to get a full
stack
> trace
> > >> >>> > >> 2
> > >> >>> > >>
> > >> >>> > >> This happens since "rush hour" for new messages
produced to
> > >> kafka.
> > >> >>> May
> > >> >>> > be this is a bug of kafka / samza?
> > >> >>> > >>
> > >> >>> > >> kafka version: 0.10.0.0
> > >> >>> > >>
> > >> >>> > >> kafka config and part of paused log are attached.
> > >> >>> > >>
> > >> >>> > >>
> > >> >>> > >>
> > >> >>> > >
> > >> >>> > >
> > >> >>> > > --
> > >> >>> > > 李斯宁
> > >> >>> > >
> > >> >>> >
> > >> >>> >
> > >> >>> >
> > >> >>> > --
> > >> >>> > 李斯宁
> > >> >>> >
> > >> >>>
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> 李斯宁
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > 李斯宁
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Thanks and regards
> > >>
> > >> Chinmay Soman
> > >>
> > >
> > >
> > >
> > > --
> > > 李斯宁
> > >
> >
> >
> >
> > --
> > 李斯宁
> >
>



-- 
李斯宁

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message