flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5940) ZooKeeperCompletedCheckpointStore cannot handle broken state handles
Date Fri, 10 Mar 2017 15:29:04 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905258#comment-15905258
] 

ASF GitHub Bot commented on FLINK-5940:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3446#discussion_r105419062
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java
---
    @@ -226,16 +200,43 @@ public CompletedCheckpoint getLatestCheckpoint() throws Exception
{
     			return null;
     		}
     		else {
    -			return checkpointStateHandles.getLast().f0.retrieveState();
    +			while(!checkpointStateHandles.isEmpty()) {
    +				Tuple2<RetrievableStateHandle<CompletedCheckpoint>, String> checkpointStateHandle
= checkpointStateHandles.peekLast();
    +
    +				try {
    +					return retrieveCompletedCheckpoint(checkpointStateHandle);
    +				} catch (FlinkException e) {
    --- End diff --
    
    Technically, I think it was ok, because the `retrieveCompletedCheckpoint` method catches
all `Exceptions` and wraps them in a `FlinkException`. But it's better to not rely on this
implementation detail.


> ZooKeeperCompletedCheckpointStore cannot handle broken state handles
> --------------------------------------------------------------------
>
>                 Key: FLINK-5940
>                 URL: https://issues.apache.org/jira/browse/FLINK-5940
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.0, 1.1.4, 1.3.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>
> The {{ZooKeeperCompletedCheckpointStore}} reads a set of {{RetrievableStateHandles}}
from ZooKeeper upon recovery. It then tries to retrieve the {{CompletedCheckpoint}} from the
latest state handle. If the retrieve operation fails, then the whole recovery of completed
checkpoints fails even though the store might have read older state handles from ZooKeeper.

> I propose to harden the behaviour by removing broken state handles and returning the
first successfully retrieved {{CompletedCheckpoint}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message