flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Richter (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (FLINK-7268) Zookeeper Checkpoint Store interacting with Incremental State Handles can lead to loss of handles
Date Tue, 15 Aug 2017 13:05:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-7268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stefan Richter closed FLINK-7268.
---------------------------------
    Resolution: Fixed

Merged in 91a4b27617 (1.4) and 09caa9ffdc (1.3)

> Zookeeper Checkpoint Store interacting with Incremental State Handles can lead to loss
of handles
> -------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-7268
>                 URL: https://issues.apache.org/jira/browse/FLINK-7268
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.3.0, 1.3.1, 1.4.0
>            Reporter: Aljoscha Krettek
>            Assignee: Stefan Richter
>            Priority: Blocker
>             Fix For: 1.4.0, 1.3.2
>
>         Attachments: gce_rocks_incr_external_gs-more-logs.txt
>
>
> Release testing for Flink 1.3.2 has shown that this combination of features leads to
this errors when using a very low restart delay:
> {code}
> java.lang.IllegalStateException: Could not initialize keyed state backend.
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:321)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:217)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:676)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:663)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:252)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: Item not found: aljoscha/state-machine-checkpoints-2/f26e2b4c6891f2a9e0c5e4ba014733c3/chk-3/b246db8c-4f25-483a-b1fc-234f4319004d
> 	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.getFileNotFoundException(GoogleCloudStorageExceptions.java:42)
> 	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.open(GoogleCloudStorageImpl.java:551)
> 	at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.open(GoogleCloudStorageFileSystem.java:322)
> 	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:121)
> 	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:1076)
> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
> 	at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.open(HadoopFileSystem.java:404)
> 	at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.open(HadoopFileSystem.java:48)
> 	at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
> 	at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:69)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.readStateData(RocksDBKeyedStateBackend.java:1281)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.readAllStateData(RocksDBKeyedStateBackend.java:1468)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstance(RocksDBKeyedStateBackend.java:1324)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:1503)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:970)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.createKeyedStateBackend(StreamTask.java:772)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:311)
> 	... 6 more
> {code}
> When this occurs the job is stuck in a restart loop. The problem (according to [~srichter])
seems to be that removal of pending checkpoints from Zookeeper happens asynchronously and
those request can go though when the Job has already restarted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message