flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Szymon Szczypiński <simo...@poczta.fm>
Subject Re: DFS problem with removing checkpoint
Date Sun, 22 Apr 2018 15:22:27 GMT
HI,
the problem was started on 1.3.1. Now I upgraded to Flink 1.3.3.
I changed my cluster to 1.3.3  because of jira 
https://issues.apache.org/jira/browse/FLINK-8807.

I will check in debug mode why cluster doesn't remove those files, maybe 
i will see why.

Best regards

On 22.04.2018 16:59, Stephan Ewen wrote:
> Hi!
>
> Sorry for the late response... In which Flink version are you?
>
> I am wondering if this is somewhat related to that specific setup: 
> Windows DFS filesystem mounted on Linux with CIFS
>
>   - For the "completedCheckpoint<SomeID>", the cleanup should happen 
> in the "ZooKeeperCompletedCheckpointStore" when dropping a checkpoint
> - For the "state.backend.fs.checkpointdir/JobId/check-<INTEGER>" 
> directory, it should (in Flink 1.3 and 1.4) be the FileStateHandle 
> that deletes the parent directory when empty, meaning the last state 
> chunk to be deleted deletes the parent directory. In Flink 1.5, it is 
> the disposal call of the CheckpointStorageLocation.
>
> Best,
> Stephan
>
> On Sat, Apr 7, 2018 at 1:11 AM, Szymon Szczypiński <simon_t@poczta.fm 
> <mailto:simon_t@poczta.fm>> wrote:
>
>     Hi,
>
>     in my case both doesn't deleted. In high-availability.storageDir
>     the number of files of type "completedCheckpoint<SomeID>" are
>     growing and also dirs in 
>     "state.backend.fs.checkpointdir/JobId/check-<INTEGER>".
>
>     In my case i have Windows DFS filesystem mounted on linux with
>     cifs protocol.
>
>     Can you give me a hint or description which process is responsible
>     for removing those files and directories.
>
>     Best regards
>
>
>     W dniu 2018-04-02 o 15:58, Stephan Ewen pisze:
>>     Can you clarify which one does not get deleted? The file in the
>>     "high-availability.storageDir", or the
>>     "state.backend.fs.checkpointdir/JobId/check-<INTEGER>", or both?
>>
>>     Could you also tell us which file system you use?
>>
>>     There is a known issue in some versions of Flink that S3
>>     "directories" are not deleted. This means that Hadoop's S3 marker
>>     files (the way that Hadoop's s3n and s3a imitate directories in
>>     S3) are not deleted. This is fixed in Flink 1.5. A workaround for
>>     Flink 1.4 is to use the "flink-s3-fs-presto", which does not uses
>>     these marker files.
>>
>>
>>     On Thu, Mar 29, 2018 at 8:49 PM, Szymon Szczypiński
>>     <simon_t@poczta.fm <mailto:simon_t@poczta.fm>> wrote:
>>
>>         Thank you for your replay.
>>
>>         But my problem is not that Flink doesn't remove
>>         files/directories after incorrect job cancellation. My
>>         problem is different.
>>         Precisely when everything is ok and job is in RUNNING state,
>>         when checkpoint for job is done, then there is created file
>>         in "high-availability.storageDir" with name
>>         completedCheckpoint<SomeID> and also is created dir in
>>         location
>>         "state.backend.fs.checkpointdir/JobId/check-<INTEGER>". After
>>         the next checkpoint is completed, previous file and dir are
>>         deleted, and this is ok, because i always have only one
>>         checkpoint.
>>         But in my case when next checkpoint is completed, the
>>         previous is not deleted and this happens when job is in
>>         running state.
>>
>>         My be you know why those files/dirs are not deleted.
>>
>>         Best regards
>>
>>         Szymon Szczypiński
>>
>>
>>         On 29.03.2018 11:23, Stephan Ewen wrote:
>>>         Flink removes these files / directories only if you properly
>>>         cancel the job. If you kill the processes (stop-cluster.sh)
>>>         this looks like a failure, and Flink will try to recover.
>>>         The recovery always starts in ZooKeeper, not in the DFS.
>>>
>>>         Best way to prevent this is to
>>>           - properly cancel jobs, not just kill processes
>>>           - use separate cluster IDs for the standalone clusters, so
>>>         that the new cluster knows that it is not supposed to
>>>         recover the previous jobs and checkpoints
>>>
>>>
>>>
>>>         On Wed, Mar 28, 2018 at 11:22 PM, Szymon Szczypiński
>>>         <simon_t@poczta.fm <mailto:simon_t@poczta.fm>> wrote:
>>>
>>>             Hi,
>>>
>>>             i have problem with Flink in version 1.3.1.
>>>
>>>             I have standalone cluster with two JobManagers and four
>>>             TaskManager, as
>>>             DFS i use windows high available storage mounted by cifs
>>>             protocol.
>>>
>>>             And sometimes i'm starting having problem that Flink
>>>             doesn't remove
>>>             checkpoint dirs for job and completedCheckpoint files from
>>>             "high-availability.storageDir".
>>>
>>>             To bring back cluster to normal working i need to remove
>>>             all dirs from
>>>             DFS and start everything from beginning.
>>>
>>>
>>>             Maybe someone of Flink users had the same problem. For
>>>             now i doesn't
>>>             have any idea how to bring back cluster to normal work
>>>             without deleting
>>>             dirs from DFS.
>>>
>>>             I don't want to delete dirs from DFS because than  i
>>>             need to redeploy
>>>             all jobs.
>>>
>>>
>>>             Best regards
>>>
>>>             Szymon Szczypiński
>>>
>>>
>>
>>
>
>


Mime
View raw message