flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From miki haiat <miko5...@gmail.com>
Subject Re: Temporary failure in name resolution
Date Mon, 09 Apr 2018 08:22:42 GMT
i attached the full logs from JM and TM

Memory and GC looks fine ,

not sure really what causing  the KM/TM to crash ...


​
 flink-beam1-taskmanager-0-beam1.log
<https://drive.google.com/file/d/1qTOV_EeNr6FSVgUstVZtf1IcDOxnzn_l/view?usp=drive_web>
​​
 flink-beam2-taskmanager-4-beam2.log
<https://drive.google.com/file/d/1CXHYkqFRDQcsKrbEU4v1LNOvNJZuY-cO/view?usp=drive_web>
​



On Wed, Apr 4, 2018 at 4:30 PM, Fabian Hueske <fhueske@gmail.com> wrote:

> Hi,
>
> The issue might be related to garbage collection pauses during which the
> TM JVM cannot communicate with the JM.
> The metrics contain a stats for memory consumpion [1] and GC activity [2]
> that can help to diagnose the problem.
>
> Best, Fabian
>
> [1] https://ci.apache.org/projects/flink/flink-docs-
> release-1.4/monitoring/metrics.html#memory
> [2] https://ci.apache.org/projects/flink/flink-docs-
> release-1.4/monitoring/metrics.html#garbagecollection
>
> 2018-04-04 8:30 GMT+02:00 miki haiat <miko5054@gmail.com>:
>
>> HI ,
>>
>> i checked the code again the figure out where the problem  can be
>>
>> i just wondered if im implementing the Evictor correctly  ?
>>
>> full code
>> https://gist.github.com/miko-code/6d7010505c3cb95be122364b29057237
>>
>>
>>
>>
>> public static class EsbTraceEvictor implements Evictor<EsbTrace, GlobalWindow>
{
>>     org.slf4j.Logger LOG = LoggerFactory.getLogger(EsbTraceEvictor.class);
>>     @Override
>>     public void evictBefore(Iterable<TimestampedValue<EsbTrace>> iterable,
int i, GlobalWindow globalWindow, Evictor.EvictorContext evictorContext) {
>>
>>     }
>>
>>     @Override
>>     public void evictAfter(Iterable<TimestampedValue<EsbTrace>> elements,
int i, GlobalWindow globalWindow, EvictorContext evictorContext) {
>>         //change it to current procces  time
>>         long min5min = LocalDateTime.now().minusMinutes(5).getNano();
>>         LOG.info("time now -5min",min5min);
>>         DateTimeFormatter format = DateTimeFormatter.ISO_DATE_TIME;
>>         for (Iterator<TimestampedValue<EsbTrace>> iterator = elements.iterator();
iterator.hasNext(); ) {
>>             TimestampedValue<EsbTrace> element = iterator.next();
>>             LocalDateTime el = LocalDateTime.parse(element.getValue().getEndDate(),format);
>>             LOG.info("element time ",element.getValue().getEndDate());
>>             if (el.minusMinutes(5).getNano() <= min5min) {
>>                 iterator.remove();
>>             }
>>         }
>>     }
>> }
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 3, 2018 at 4:28 PM, Hao Sun <hasun@zendesk.com> wrote:
>>
>>> Hi Timo, we do have similar issue, TM got killed by a job. Is there a
>>> way to monitor JVM status? If through the monitor metrics, what metric I
>>> should look after?
>>> We are running Flink on K8S. Is there a possibility that a job consumes
>>> too much network bandwidth, so JM and TM can not connect?
>>>
>>> On Tue, Apr 3, 2018 at 3:11 AM Timo Walther <twalthr@apache.org> wrote:
>>>
>>>> Hi Miki,
>>>>
>>>> for me this sounds like your job has a resource leak such that your
>>>> memory fills up and the JVM of the TaskManager is killed at some point. How
>>>> does your job look like? I see a WindowedStream.apply which might not be
>>>> appropriate if you have big/frequent windows where the evaluation happens
>>>> too late such that the state becomes too big.
>>>>
>>>> Regards,
>>>> Timo
>>>>
>>>>
>>>> Am 03.04.18 um 08:26 schrieb miki haiat:
>>>>
>>>> i tried to run flink on kubernetes and  as stand alone HA cluster and
>>>> on both cases  task manger got lost/kill after few hours/days    .
>>>> im using ubuntu and flink 1.4.2 .
>>>>
>>>>
>>>> this is part of the log , i also attaches the full log .
>>>>
>>>>>
>>>>> org.tlv.esb.StreamingJob$EsbTraceEvictor@20ffca60,
>>>>> WindowedStream.apply(WindowedStream.java:1061)) -> Sink: Unnamed
>>>>> (1/1) (91b27853aa30be93322d9c516ec266bf) switched from RUNNING to
>>>>> FAILED.
>>>>> java.lang.Exception: TaskManager was lost/killed:
>>>>> 6dc6cd5c15588b49da39a31b6480b2e3 @ beam2 (dataPort=42587)
>>>>> at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(Sim
>>>>> pleSlot.java:217)
>>>>> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment
>>>>> .releaseSharedSlot(SlotSharingGroupAssignment.java:523)
>>>>> at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(Sha
>>>>> redSlot.java:192)
>>>>> at org.apache.flink.runtime.instance.Instance.markDead(Instance
>>>>> .java:167)
>>>>> at org.apache.flink.runtime.instance.InstanceManager.unregister
>>>>> TaskManager(InstanceManager.java:212)
>>>>> at org.apache.flink.runtime.jobmanager.JobManager.org$apache$fl
>>>>> ink$runtime$jobmanager$JobManager$$handleTaskManagerTerminat
>>>>> ed(JobManager.scala:1198)
>>>>> at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$hand
>>>>> leMessage$1.applyOrElse(JobManager.scala:1096)
>>>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF
>>>>> unction.scala:36)
>>>>> at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun
>>>>> $receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>>>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF
>>>>> unction.scala:36)
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag
>>>>> es.scala:33)
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag
>>>>> es.scala:28)
>>>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(Log
>>>>> Messages.scala:28)
>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>>>> at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive
>>>>> (JobManager.scala:122)
>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>>>> at akka.actor.dungeon.DeathWatch$class.receivedTerminated(Death
>>>>> Watch.scala:46)
>>>>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374)
>>>>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511)
>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:494)
>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.j
>>>>> ava:260)
>>>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(For
>>>>> kJoinPool.java:1339)
>>>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo
>>>>> l.java:1979)
>>>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW
>>>>> orkerThread.java:107)
>>>>> 2018-04-02 13:09:01,727 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>> - Job Flink Streaming esb correlate msg (0db04ff29124f59a123d4743d89473ed)
>>>>> switched from state RUNNING to FAILING.
>>>>> java.lang.Exception: TaskManager was lost/killed:
>>>>> 6dc6cd5c15588b49da39a31b6480b2e3 @ beam2 (dataPort=42587)
>>>>> at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(Sim
>>>>> pleSlot.java:217)
>>>>> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment
>>>>> .releaseSharedSlot(SlotSharingGroupAssignment.java:523)
>>>>> at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(Sha
>>>>> redSlot.java:192)
>>>>> at org.apache.flink.runtime.instance.Instance.markDead(Instance
>>>>> .java:167)
>>>>> at org.apache.flink.runtime.instance.InstanceManager.unregister
>>>>> TaskManager(InstanceManager.java:212)
>>>>> at org.apache.flink.runtime.jobmanager.JobManager.org$apache$fl
>>>>> ink$runtime$jobmanager$JobManager$$handleTaskManagerTerminat
>>>>> ed(JobManager.scala:1198)
>>>>> at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$hand
>>>>> leMessage$1.applyOrElse(JobManager.scala:1096)
>>>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF
>>>>> unction.scala:36)
>>>>> at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun
>>>>> $receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>>>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF
>>>>> unction.scala:36)
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag
>>>>> es.scala:33)
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag
>>>>> es.scala:28)
>>>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>>>> at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(Log
>>>>> Messages.scala:28)
>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>>>> at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive
>>>>> (JobManager.scala:122)
>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>>>> at akka.actor.dungeon.DeathWatch$class.receivedTerminated(Death
>>>>> Watch.scala:46)
>>>>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374)
>>>>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511)
>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:494)
>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.j
>>>>> ava:260)
>>>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(For
>>>>> kJoinPool.java:1339)
>>>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo
>>>>> l.java:1979)
>>>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW
>>>>> orkerThread.java:107)
>>>>> 2018-04-02 13:09:01,737 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>> - Source: Custom Source (1/1) (a10c25c2d3de57d33828524938fcfcc2)
>>>>> switched from RUNNING to CANCELING.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>

Mime
View raw message