flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Yao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-12385) RestClusterClient can hang indefinitely during job submission
Date Fri, 10 May 2019 09:44:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837106#comment-16837106
] 

Gary Yao commented on FLINK-12385:
----------------------------------

It is true that there is no timeouts when waiting for the {{jobSubmissionFuture}} but there
are timeouts for asynchronous operations that the future depends on:

https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#rest-await-leader-timeout
https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#rest-connection-timeout
https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#rest-connection-timeout

The client also retries operations 20 times by default, which could explain why the client
seems to hang indefinitely:

https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#rest-retry-max-attempts

You can try to reduce the retry attempts. I am hesitant to just add a timeout to {{CompletableFuture.get()}}
because it would be orthogonal to the timeouts we already have. If this problem is reproducible,
can you attach client logs and jobmanager logs on debug level? 

Judging from the stack trace, you are submitting the job in detached mode – is that right?


> RestClusterClient can hang indefinitely during job submission
> -------------------------------------------------------------
>
>                 Key: FLINK-12385
>                 URL: https://issues.apache.org/jira/browse/FLINK-12385
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / REST
>    Affects Versions: 1.8.0
>            Reporter: Matt Dailey
>            Priority: Minor
>
> We have had situations where clients would hang indefinitely during job submission, even
when job submission would succeed. We have not yet characterized what happened on the server
to cause this, but we thought that the client should have a timeout for these requests.
> This was observed in Flink 1.5.5, but the code seems to still have this problem in 1.8.0.
One option is to include a timeout in calls to {{CompletableFuture.get()}}:
>  * [RestClusterClient in 1.5.5|https://github.com/apache/flink/blob/release-1.5.5/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L246]
>  * [RestClusterClient in 1.8.0|https://github.com/apache/flink/blob/release-1.8.0/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L247]
> Thread dump from client running Flink 1.5.5, running in Java 8:
> {noformat}
> http-nio-0.0.0.0-8443-exec-6" #34 daemon prio=5 os_prio=0 tid=0x000055b421fd2000 nid=0x29
waiting on condition [0x00007f932e176000]
>    java.lang.Thread.State: WAITING (parking)
> 	at sun.misc.Unsafe.park(Native Method)
> 	- parking to wait for  <0x00000000b331d7c0> (a java.util.concurrent.CompletableFuture$Signaller)
> 	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> 	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> 	at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> 	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> 	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:246)
> 	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
> 	at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
> 	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:410)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message