flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-22566) Running Kerberized YARN application on Docker test (default input) fails with no resources
Date Fri, 07 May 2021 13:12:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340817#comment-17340817
] 

Matthias commented on FLINK-22566:
----------------------------------

I had some discussion about it with [~fly_in_gis]. The NodeManager logs might have been helpful
in this case. The NodeManager is in charge of downloading the jar's before actually starting
the TaskManagers. The NodeManager's logs are located on the worker nodes which we haven't
accessed so far. I added commits to cover that.

The initial idea was to increase the timeout as well. But I didn't increased it for now. We
might want to understand the issue before increasing the timeout. It could be an infrastructure
problem. In that case, we increasing the timeout would make sense. I'm just afraid that it's
a different problem which we're not aware of right now. Increasing the timeout in that case
would just mask it. I rather run into the same problem again hoping to investigate the NodeManager
logs next time.

> Running Kerberized YARN application on Docker test (default input) fails with no resources
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-22566
>                 URL: https://issues.apache.org/jira/browse/FLINK-22566
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.13.0
>            Reporter: Dawid Wysakowicz
>            Assignee: Matthias
>            Priority: Blocker
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17558&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529&l=8745
> {code}
> May 05 01:29:04 Caused by: java.util.concurrent.TimeoutException: Timeout has occurred:
120000 ms
> May 05 01:29:04 	at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86)
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[?:1.8.0_292]
> May 05 01:29:04 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_292]
> May 05 01:29:04 	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at akka.actor.Actor$class.aroundReceive(Actor.scala:517) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at akka.actor.ActorCell.invoke(ActorCell.scala:561) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at akka.dispatch.Mailbox.run(Mailbox.scala:225) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	at akka.dispatch.Mailbox.exec(Mailbox.scala:235) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
> May 05 01:29:04 	... 4 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message