flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Yao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10104) Job super slow to start
Date Mon, 13 Aug 2018 06:38:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577887#comment-16577887
] 

Gary Yao commented on FLINK-10104:
----------------------------------

Hi [~fsimond],

I assume you are using Hortonworks HDP 2.5. I was not able to reproduce your
symptoms on their VM. Then I had a deeper look at the logs, in which I see many
occurrences of:
{noformat}
No open TaskExecutor connection <CONTAINER_ID>. Ignoring close TaskExecutor connection.
{noformat}
This is logged in {{ResourceManager#closeTaskManagerConnection}} [1] but
unfortunately we do not log the exception. I suspect that the method is called
from {{YarnResourceManager#onContainersCompleted}} [2]. This method is a callback
invoked by YARN when a container completes. Because there is only a single
TaskManager log in your file (the one that succeeded to run the job), I assume
that the containers are stopped for reasons that are outside of Flink's
control (maybe a problem related to your YARN setup).

I would suggest the following things for further troubleshooting: 

* Add improved logging to Flink, and build a custom Flink distribution [3]. For example, log
the {{ContainerStatus}} instances in {{onContainersCompleted}}. The {{ContainerStatus}} has
a diagnostics string that can be helpful. 
* If the improved logging does not help, check YARN logs for hints on why the containers exited.
* Try deploying using the Apache Hadoop distribution.

Best,
Gary

[1] https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L797

[2] https://github.com/apache/flink/blob/release-1.5.2/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L339

[3] https://ci.apache.org/projects/flink/flink-docs-master/start/building.html



> Job super slow to start
> -----------------------
>
>                 Key: FLINK-10104
>                 URL: https://issues.apache.org/jira/browse/FLINK-10104
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.5.2
>            Reporter: Florian
>            Priority: Major
>         Attachments: flink2.log
>
>
> Following a discussion on another topic with [~GJL] ( [http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Could-not-build-the-program-from-JAR-file-td22102.html
)|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Could-not-build-the-program-from-JAR-file-td22102.html]
> It seems that there is a bug as my job is very slow to start.
> I am using Flink to process messages from an input topic, and to redirect them to two
output topics, and when I start the job, I have to wait between 5 and 10 minutes before I
get anything into the output topic. With version 1.4.2, it was much faster.
> I run the job on Yarn, and, as asked by Gary, I attached the results of yarn logs -applicationId
<appId>
>  
> Also, as you can notice from the logs, the reported version is 0.1 Rev:1a9b648. I have
no clue why, as I downloaded the official Flink 1.5.2 distribution
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message