flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Szymon Szczypiński <simo...@poczta.fm>
Subject Re: DFS problem with removing checkpoint
Date Fri, 06 Apr 2018 23:11:53 GMT
Hi,

in my case both doesn't deleted. In high-availability.storageDir the 
number of files of type "completedCheckpoint<SomeID>" are growing and 
also dirs in "state.backend.fs.checkpointdir/JobId/check-<INTEGER>".

In my case i have Windows DFS filesystem mounted on linux with cifs 
protocol.

Can you give me a hint or description which process is responsible for 
removing those files and directories.

Best regards


W dniu 2018-04-02 o 15:58, Stephan Ewen pisze:
> Can you clarify which one does not get deleted? The file in the 
> "high-availability.storageDir", or the 
> "state.backend.fs.checkpointdir/JobId/check-<INTEGER>", or both?
>
> Could you also tell us which file system you use?
>
> There is a known issue in some versions of Flink that S3 "directories" 
> are not deleted. This means that Hadoop's S3 marker files (the way 
> that Hadoop's s3n and s3a  imitate directories in S3) are not deleted. 
> This is fixed in Flink 1.5. A workaround for Flink 1.4 is to use the 
> "flink-s3-fs-presto", which does not uses these marker files.
>
>
> On Thu, Mar 29, 2018 at 8:49 PM, Szymon Szczypiński <simon_t@poczta.fm 
> <mailto:simon_t@poczta.fm>> wrote:
>
>     Thank you for your replay.
>
>     But my problem is not that Flink doesn't remove files/directories
>     after incorrect job cancellation. My problem is different.
>     Precisely when everything is ok and job is in RUNNING state, when
>     checkpoint for job is done, then there is created file in
>     "high-availability.storageDir" with name
>     completedCheckpoint<SomeID> and also is created dir in location
>     "state.backend.fs.checkpointdir/JobId/check-<INTEGER>". After the
>     next checkpoint is completed, previous file and dir are deleted,
>     and this is ok, because i always have only one checkpoint.
>     But in my case when next checkpoint is completed, the previous is
>     not deleted and this happens when job is in running state.
>
>     My be you know why those files/dirs are not deleted.
>
>     Best regards
>
>     Szymon Szczypiński
>
>
>     On 29.03.2018 11:23, Stephan Ewen wrote:
>>     Flink removes these files / directories only if you properly
>>     cancel the job. If you kill the processes (stop-cluster.sh) this
>>     looks like a failure, and Flink will try to recover. The recovery
>>     always starts in ZooKeeper, not in the DFS.
>>
>>     Best way to prevent this is to
>>       - properly cancel jobs, not just kill processes
>>       - use separate cluster IDs for the standalone clusters, so that
>>     the new cluster knows that it is not supposed to recover the
>>     previous jobs and checkpoints
>>
>>
>>
>>     On Wed, Mar 28, 2018 at 11:22 PM, Szymon Szczypiński
>>     <simon_t@poczta.fm <mailto:simon_t@poczta.fm>> wrote:
>>
>>         Hi,
>>
>>         i have problem with Flink in version 1.3.1.
>>
>>         I have standalone cluster with two JobManagers and four
>>         TaskManager, as
>>         DFS i use windows high available storage mounted by cifs
>>         protocol.
>>
>>         And sometimes i'm starting having problem that Flink doesn't
>>         remove
>>         checkpoint dirs for job and completedCheckpoint files from
>>         "high-availability.storageDir".
>>
>>         To bring back cluster to normal working i need to remove all
>>         dirs from
>>         DFS and start everything from beginning.
>>
>>
>>         Maybe someone of Flink users had the same problem. For now i
>>         doesn't
>>         have any idea how to bring back cluster to normal work
>>         without deleting
>>         dirs from DFS.
>>
>>         I don't want to delete dirs from DFS because than  i need to
>>         redeploy
>>         all jobs.
>>
>>
>>         Best regards
>>
>>         Szymon Szczypiński
>>
>>
>
>


Mime
View raw message