flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-11665) Flink fails to remove JobGraph from ZK even though it reports it did
Date Wed, 06 Mar 2019 16:38:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Till Rohrmann updated FLINK-11665:
    Affects Version/s: 1.8.0

> Flink fails to remove JobGraph from ZK even though it reports it did
> --------------------------------------------------------------------
>                 Key: FLINK-11665
>                 URL: https://issues.apache.org/jira/browse/FLINK-11665
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.5.5, 1.6.4, 1.7.2, 1.8.0
>            Reporter: Bashar Abdul Jawad
>            Assignee: Andrey Zagrebin
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: FLINK-11665.csv
>          Time Spent: 10m
>  Remaining Estimate: 0h
> We recently have seen the following issue with Flink 1.5.5:
> Given Flink Job ID 1d24cad26843dcebdfca236d5e3ad82a: 
> 1- A job is activated successfully and the job graph added to ZK:
> {code:java}
> Added SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null) to ZooKeeper.
> {code}
> 2- Job is deactivated, Flink reports that the job graph has been successfully removed
from ZK and the blob is deleted from the blob server (in this case S3):
> {code:java}
> Removed job graph 1d24cad26843dcebdfca236d5e3ad82a from ZooKeeper.
> {code}
> 3- JM is later restarted, Flink for some reason attempts to recover the job that it
reported earlier it has removed from ZK but since the blob has already been deleted the JM
goes into a crash loop. The only way to recover it manually is to remove the job graph entry
from ZK:
> {code:java}
> Recovered SubmittedJobGraph(1d24cad26843dcebdfca236d5e3ad82a, null).	
> {code}
> and
> {code:java}
> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey;
Request ID: 1BCDFD83FC4546A2), S3 Extended Request ID: OzZtMbihzCm1LKy99s2+rgUMxyll/xYmL6ouMvU2eo30wuDbUmj/DAWoTCs9pNNCLft0FWqbhTo=
(Path: s3://blam-state-staging/flink/default/blob/job_1d24cad26843dcebdfca236d5e3ad82a/blob_p-c51b25cc0b20351f6e32a628bb6e674ee48a273e-ccfa96b0fd795502897c73714185dde3)
> {code}
> My question is under what circumstances would this happen? this seems to happen very
infrequently but since the consequence is severe (JM crash loop) we'd like to understand how
it would happen.
> This  all seems a little similar to https://issues.apache.org/jira/browse/FLINK-9575
but that issue is reported fixed in Flink 1.5.2 and we are already on Flink 1.5.5

This message was sent by Atlassian JIRA

View raw message