flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chesnay Schepler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10756) TaskManagerProcessFailureBatchRecoveryITCase did not finish on time
Date Wed, 06 Mar 2019 14:20:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785673#comment-16785673

Chesnay Schepler commented on FLINK-10756:

I found a possible explanation for the second issue. The {{AbstractTaskManagerProcessFailureRecoveryTest}}
starts 3 taskmanagers, 2 of which should still be running at the end of the test. These 2
are shutdown using {{Process#destroy()}}, however this method does not guarantee that the
process actually has shut down.

There are 2 tests that extend {{AbstractTaskManagerProcessFailureRecoveryTest}}, one for batch/streaming

It could happen that one of the TM from the first test is still running when the the second
test executes. If any task is scheduled to one of these TMs we can run into a second job failure
(1 when this TM does finally shutdown, and another one caused by the test as expected), but
only 1 is allowed. ({{env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1, 0L));}})

> TaskManagerProcessFailureBatchRecoveryITCase did not finish on time
> -------------------------------------------------------------------
>                 Key: FLINK-10756
>                 URL: https://issues.apache.org/jira/browse/FLINK-10756
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Bowen Li
>            Assignee: Chesnay Schepler
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.8.0
> {code:java}
> Failed tests: 
>   TaskManagerProcessFailureBatchRecoveryITCase>AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure:207
The program did not finish in time
> {code}
> https://travis-ci.org/apache/flink/jobs/449439623

This message was sent by Atlassian JIRA

View raw message