samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Pan <nickpa...@gmail.com>
Subject Re: Samza container hang on exception
Date Tue, 06 Sep 2016 17:05:38 GMT
Hi, Sining,

I took a look at your log and stack traces and want to clarify two points:

1) It seems that your container actually exited, instead of hanging, based
on the log, which is the expected behavior from 0.10.1 (retry X times and
error-out in SamzaContainer RunLoop).
2) The Kafka producer client keeps getting "REQUEST_TIMEOUT" exception from
the send call. This is typically the case when your Kafka cluster is
overwhelmed. There are some known issues in Kafka broker 0.8.2 that causes
the producer stuck (KAFKA-1788). We did not get the full stack trace from
the Kafka producer client lib from your run but I suspect that might be the
issue, if you are running Kafka broker 0.8.2. I would recommend to increase
your Kafka footprint and move the broker vip to a less-loaded host to see
whether the problem goes away.

Let me know if we can be more helpful.

Thanks!

-Yi

On Fri, Sep 2, 2016 at 2:17 AM, 李斯宁 <lisining@gmail.com> wrote:

> yes, upgraded to 0.10.1
>
> jstack:
> https://drive.google.com/open?id=0B19olQZ1dUO8VjltQmtxLTJ4SVdFZ
> WhYWHZ3Y2hMOVhCMWNn
> task log:
> https://drive.google.com/open?id=0B19olQZ1dUO8eVRLWmJCVl9nRlg2U
> UM4c21udUViWW8tSUVV
>
> On Fri, Sep 2, 2016 at 4:41 PM, Yi Pan <nickpan47@gmail.com> wrote:
>
> > Hi, Sining,
> >
> > You note is on a site that I don't have account/access and it requires
> > sign-up. Can you share it via google doc, since you have a gmail account?
> > And just to confirm, you have upgrade and using 0.10.1 now, right?
> >
> > Thanks and apologize for the delay.
> >
> > -Yi
> >
> > On Fri, Sep 2, 2016 at 1:03 AM, 李斯宁 <lisining@gmail.com> wrote:
> >
> > > Can any one help on this? Thanks!
> > >
> > > On Thu, Sep 1, 2016 at 11:59 AM, 李斯宁 <lisining@gmail.com> wrote:
> > >
> > > > If you cannot see the attachment, please try http://note.youdao.com/
> > > > noteshare?id=56b826c24af47a9fdb600490ce788710
> > > >
> > > > On Thu, Sep 1, 2016 at 1:50 AM, Chinmay Soman <
> > chinmay.cerebro@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >> Sorry dont see anything in the attachment. Can you please re-attach
> > and
> > > >> re-send ?
> > > >>
> > > >> On Wed, Aug 31, 2016 at 3:27 AM, 李斯宁 <lisining@gmail.com>
wrote:
> > > >>
> > > >> > It seems upgrading does not solve the problem. All task hang
in
> > > today's
> > > >> > "rush hour".
> > > >> > I attached log and jstack.
> > > >> >
> > > >> > The SAMZA-911 want to fix by stopping the process if failed too
> much
> > > >> > times.  But the process is still there and hanging.
> > > >> >
> > > >> > On Mon, Aug 22, 2016 at 1:14 PM, 李斯宁 <lisining@gmail.com>
wrote:
> > > >> >
> > > >> >> Thanks so much, I'll try.
> > > >> >>
> > > >> >> On Mon, Aug 22, 2016 at 6:26 AM, Yi Pan <nickpan47@gmail.com>
> > wrote:
> > > >> >>
> > > >> >>> Hi, Sining,
> > > >> >>>
> > > >> >>> This is a known bug that is fixed in 0.10.1 (SAMZA-911).
Please
> > try
> > > to
> > > >> >>> upgrade to 0.10.1.
> > > >> >>>
> > > >> >>> Thanks!
> > > >> >>>
> > > >> >>> -Yi
> > > >> >>>
> > > >> >>> On Sun, Aug 21, 2016 at 5:55 AM, 李斯宁 <lisining@gmail.com>
> wrote:
> > > >> >>>
> > > >> >>> > I have tried restart every kafka server.  The container
did
> not
> > > >> >>> recover.
> > > >> >>> >
> > > >> >>> > log have something below:
> > > >> >>> >
> > > >> >>> > 2016-08-21 20:08:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > :66
> > > >> )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> > org.apache.kafka.common.errors.NotLeaderForPartitionException
> :
> > > This
> > > >> >>> server
> > > >> >>> > is not the leader for that topic-partition.. Turn
on debugging
> > to
> > > >> get a
> > > >> >>> > full stack trace
> > > >> >>> > 2016-08-21 20:08:22 [WARN ](o.a.k.c.p.i.Sender
> > >  :257)
> > > >> >>> Got
> > > >> >>> > error produce response with correlation id 4364
on
> > topic-partition
> > > >> >>> > samzaMetrics-5, retrying (0 attempts left). Error:
> > > >> >>> NOT_LEADER_FOR_PARTITION
> > > >> >>> > 2016-08-21 20:08:23 [WARN ](o.a.k.c.p.i.Sender
> > >  :257)
> > > >> >>> Got
> > > >> >>> > error produce response with correlation id 4367
on
> > topic-partition
> > > >> >>> > samzaMetrics-5, retrying (29 attempts left). Error:
> > > >> >>> > NOT_LEADER_FOR_PARTITION
> > > >> >>> >
> > > >> >>> >
> > > >> >>> > jstack shows:
> > > >> >>> >
> > > >> >>> > "main" #1 prio=5 os_prio=0 tid=0x00007f1ba401a000
nid=0x1a621
> > > >> waiting
> > > >> >>> on
> > > >> >>> > condition [0x00007f1bab976000]
> > > >> >>> > java.lang.Thread.State: TIMED_WAITING (sleeping)
> > > >> >>> > at java.lang.Thread.sleep(Native Method)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.util.ExponentialSleepStrategy$RetryLoopStat
> > > >> e.sleep(
> > > >> >>> > ExponentialSleepStrategy.scala:105)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.util.ExponentialSleepStrategy.run(
> > > >> >>> > ExponentialSleepStrategy.scala:91)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.system.kafka.KafkaSystemProducer.send(
> > > >> >>> > KafkaSystemProducer.scala:91)
> > > >> >>> > at org.apache.samza.system.SystemProducers.send(
> SystemProducers
> > > >> >>> .scala:87)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.task.TaskInstanceCollector.send(
> > > >> >>> > TaskInstanceCollector.scala:61)
> > > >> >>> > at toolbox.analyzer2.realtime.CommonWriter.write(
> CommonWriter.
> > > >> java:50)
> > > >> >>> > at toolbox.analyzer2.realtime.InitTask.lambda$process$0(
> InitTas
> > > >> >>> k.java:110)
> > > >> >>> > at toolbox.analyzer2.realtime.InitTask$$Lambda$4/938405008.
> emit
> > > >> >>> (Unknown
> > > >> >>> > Source)
> > > >> >>> > at
> > > >> >>> > toolbox.analyzer2.util.core.TransToKvProcessor.process(
> > > >> >>> > TransToKvProcessor.java:146)
> > > >> >>> > at toolbox.analyzer2.realtime.InitTask$2.emit(InitTask.java:
> > 119)
> > > >> >>> > at toolbox.analyzer2.util.core.JsonExpander.expand(
> JsonExpander
> > > >> >>> .java:47)
> > > >> >>> > at toolbox.analyzer2.realtime.InitTask.process(InitTask.
> > java:128)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.container.TaskInstance$$anonfun$process$
> > > >> >>> > 1.apply$mcV$sp(TaskInstance.scala:150)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.container.TaskInstanceExceptionHandler.mayb
> > > >> eHandle(
> > > >> >>> > TaskInstanceExceptionHandler.scala:54)
> > > >> >>> > at org.apache.samza.container.TaskInstance.process(
> TaskInstance
> > > >> >>> .scala:149)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.container.RunLoop$$anonfun$process$1$$
> > > >> >>> > anonfun$apply$mcVJ$sp$2.apply(RunLoop.scala:122)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.container.RunLoop$$anonfun$process$1$$
> > > >> >>> > anonfun$apply$mcVJ$sp$2.apply(RunLoop.scala:119)
> > > >> >>> > at scala.collection.immutable.List.foreach(List.scala:318)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.container.RunLoop$$anonfun$process$1.
> > > >> >>> > apply$mcVJ$sp(RunLoop.scala:118)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.util.TimerUtils$class.
> > updateTimerAndGetDuration(
> > > >> >>> > TimerUtils.scala:51)
> > > >> >>> > at
> > > >> >>> > org.apache.samza.container.RunLoop.updateTimerAndGetDuration(
> > > >> >>> > RunLoop.scala:35)
> > > >> >>> > at org.apache.samza.container.RunLoop.process(RunLoop.scala:
> > 106)
> > > >> >>> > at org.apache.samza.container.RunLoop.run(RunLoop.scala:74)
> > > >> >>> > at org.apache.samza.container.SamzaContainer.run(
> SamzaContainer
> > > >> >>> .scala:553)
> > > >> >>>
> > > >> >>> > at
> > > >> >>> > org.apache.samza.container.SamzaContainer$.safeMain(
> > > >> >>> > SamzaContainer.scala:92)
> > > >> >>> > at org.apache.samza.container.SamzaContainer$.main(
> > > >> >>> > SamzaContainer.scala:66)
> > > >> >>> > at org.apache.samza.container.SamzaContainer.main(
> SamzaContaine
> > > >> >>> r.scala)
> > > >> >>> >
> > > >> >>> > May be partition leader has changed in rush hour
and metrics
> > > writing
> > > >> >>> method
> > > >> >>> > do not recognize that and retry again and again?
> > > >> >>> >
> > > >> >>> > Any response is appreciated :)
> > > >> >>> >
> > > >> >>> > On Sun, Aug 21, 2016 at 8:00 PM, 李斯宁 <lisining@gmail.com>
> > wrote:
> > > >> >>> >
> > > >> >>> > > at the last of the container's log, prints
these:
> > > >> >>> > >
> > > >> >>> > > 2016-08-21 19:57:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:57:11 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:57:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:57:31 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:57:41 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:57:51 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:58:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:58:11 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:58:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:58:31 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:58:41 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:58:51 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > > 2016-08-21 19:59:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66 )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > >
> > > >> >>> > >
> > > >> >>> > > On Sun, Aug 21, 2016 at 7:38 PM, 李斯宁
<lisining@gmail.com>
> > > wrote:
> > > >> >>> > >
> > > >> >>> > >> hi, guys
> > > >> >>> > >> I'm using samza in realtime process. After
running for
> about
> > 10
> > > >> >>> hours,
> > > >> >>> > >> some containers paused and not processing.
> > > >> >>> > >>
> > > >> >>> > >> When I looked into the log, I found a lot
of
> > > >> >>> > >>
> > > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > > >>  :257)
> > > >> >>> > Got error produce response with correlation id 490345
on
> > > >> >>> topic-partition
> > > >> >>> > test3_a2_mobileDictClient_android_uid_imei-3, retrying
(17
> > > attempts
> > > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > > >>  :257)
> > > >> >>> > Got error produce response with correlation id 490345
on
> > > >> >>> topic-partition
> > > >> >>> > test3_a2_mobileDictClient_android_uid_imei-4, retrying
(18
> > > attempts
> > > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > > >>  :257)
> > > >> >>> > Got error produce response with correlation id 490345
on
> > > >> >>> topic-partition
> > > >> >>> > test3_a2_mobileDictClient_android_uid_imei-6, retrying
(18
> > > attempts
> > > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > > >>  :257)
> > > >> >>> > Got error produce response with correlation id 490346
on
> > > >> >>> topic-partition
> > > >> >>> > test3_a2_mobileDictClient_android_uid_imei-3, retrying
(16
> > > attempts
> > > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > > >>  :257)
> > > >> >>> > Got error produce response with correlation id 490346
on
> > > >> >>> topic-partition
> > > >> >>> > test3_a2_mobileDictClient_android_uid_imei-4, retrying
(17
> > > attempts
> > > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > > >> >>> > >> 2016-08-21 10:03:07 [WARN ](o.a.k.c.p.i.Sender
> > > >>  :257)
> > > >> >>> > Got error produce response with correlation id 490346
on
> > > >> >>> topic-partition
> > > >> >>> > test3_a2_mobileDictClient_android_uid_imei-6, retrying
(17
> > > attempts
> > > >> >>> > left). Error: NOT_LEADER_FOR_PARTITION
> > > >> >>> > >>
> > > >> >>> > >> ...
> > > >> >>> > >>
> > > >> >>> > >> 2016-08-21 10:49:01 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66
> > > >> >>> )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > >> 2016-08-21 10:49:11 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66
> > > >> >>> )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > >> 2016-08-21 10:49:21 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66
> > > >> >>> )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > >> 2016-08-21 10:49:31 [WARN ](o.a.s.s.k.KafkaSystemProducer
> > > >> :66
> > > >> >>> )
> > > >> >>> > Retrying send messsage due to RetriableException
-
> > > >> >>> org.apache.kafka.common.
> > > >> >>> > errors.NotLeaderForPartitionException: This server
is not the
> > > >> leader
> > > >> >>> for
> > > >> >>> > that topic-partition.. Turn on debugging to get
a full stack
> > trace
> > > >> >>> > >> 2
> > > >> >>> > >>
> > > >> >>> > >> This happens since "rush hour" for new
messages produced to
> > > >> kafka.
> > > >> >>> May
> > > >> >>> > be this is a bug of kafka / samza?
> > > >> >>> > >>
> > > >> >>> > >> kafka version: 0.10.0.0
> > > >> >>> > >>
> > > >> >>> > >> kafka config and part of paused log are
attached.
> > > >> >>> > >>
> > > >> >>> > >>
> > > >> >>> > >>
> > > >> >>> > >
> > > >> >>> > >
> > > >> >>> > > --
> > > >> >>> > > 李斯宁
> > > >> >>> > >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> > --
> > > >> >>> > 李斯宁
> > > >> >>> >
> > > >> >>>
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> 李斯宁
> > > >> >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > 李斯宁
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Thanks and regards
> > > >>
> > > >> Chinmay Soman
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > 李斯宁
> > > >
> > >
> > >
> > >
> > > --
> > > 李斯宁
> > >
> >
>
>
>
> --
> 李斯宁
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message