flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hequn Cheng (Jira)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-13940) S3RecoverableWriter causes job to get stuck in recovery
Date Wed, 29 Jan 2020 03:38:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hequn Cheng updated FLINK-13940:
--------------------------------
    Fix Version/s:     (was: 1.9.2)

> S3RecoverableWriter causes job to get stuck in recovery
> -------------------------------------------------------
>
>                 Key: FLINK-13940
>                 URL: https://issues.apache.org/jira/browse/FLINK-13940
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / FileSystem
>    Affects Versions: 1.8.0, 1.8.1, 1.9.0
>            Reporter: Jimmy Weibel Rasmussen
>            Assignee: Kostas Kloudas
>            Priority: Major
>             Fix For: 1.10.0
>
>
>  
>  The cleaning up of tmp files in S3 introduced by this ticket/PR:
>  https://issues.apache.org/jira/browse/FLINK-10963
>   is preventing the flink job from being able to recover under some circumstances.
>   
>  This is what seems to be happening:
>  When the jobs tries to recover, it will call initializeState() on all operators, which
results in the Bucket.restoreInProgressFile method being called.
>  This will download the part_tmp file mentioned in the checkpoint that we're restoring
from, and finally it will call fsWriter.cleanupRecoverableState which deletes the part_tmp
file in S3.
>   Now the open() method is called on all operators. If the open() call fails for one
of the operators (this might happen if the issue that caused the job to fail and restart is
still unresolved), the job will fail again and try to restart from the same checkpoint as
before. This time however, downloading the part_tmp file mentioned in the checkpoint fails
because it was deleted during the last recover attempt.
> The bug is critical because it results in data loss.
>   
>   
>   
>  I discovered the bug because I have a flink job with a RabbitMQ source and a StreamingFileSink
that writes to S3 (and therefore uses the S3RecoverableWriter).
>  Occasionally I have some RabbitMQ connection issues which causes the job to fail and
restart, sometimes the first few restart attempts fail because rabbitmq is unreachable when
flink tries to reconnect.
>   
>  This is what I was seeing:
>  RabbitMQ goes down
>  Job fails because of a RabbitMQ ConsumerCancelledException
>  Job attempts to restart but fails with a Rabbitmq connection exception (x number of
times)
>  RabbitMQ is back up
>  Job attempts to restart but fails with a FileNotFoundException due to some _part_tmp
file missing in S3.
>   
>  The job will be unable to restart and only option is to cancel and restart the job (and
loose all state)
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message