flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4218) Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." causes task restarting
Date Fri, 23 Sep 2016 20:23:20 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517476#comment-15517476
] 

ASF GitHub Bot commented on FLINK-4218:
---------------------------------------

Github user StefanRRichter commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2544#discussion_r80318790
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java
---
    @@ -301,9 +301,16 @@ public StreamStateHandle closeAndGetHandle() throws IOException {
     					}
     					else {
     						flush();
    +
    +						long size = -1;
    --- End diff --
    
    I am not sure if returning -1 as size on exception is ideal. Currently, this value should
one be used in the calculation of meta data, but one might be tempted to use it e.g. to preallocate
a byte[] to read the file into, so this should at least be documented in `StateObject`. Furthermore,
we make the assumption that the stream position is also equal to the final file size. Not
entirely sure if this holds for all streams and file systems, but I guess this is the best
we can do without asking the file system for meta data.


> Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." causes task restarting
> ----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-4218
>                 URL: https://issues.apache.org/jira/browse/FLINK-4218
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.1.0
>            Reporter: Sergii Koshel
>
> Sporadically see exception as below. And restart of task because of it.
> {code:title=Exception|borderStyle=solid}
> java.lang.RuntimeException: Error triggering a checkpoint as the result of receiving
checkpoint barrier
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:785)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:775)
> 	at org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:203)
> 	at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:129)
> 	at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:183)
> 	at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:66)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:265)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: No such file or directory: s3://<bucket_name_here>/flink/checkpoints/ece317c26960464ba5de75f3bbc38cb2/chk-8810/96eebbeb-de14-45c7-8ebb-e7cde978d6d3
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:996)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
> 	at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.getFileStatus(HadoopFileSystem.java:351)
> 	at org.apache.flink.runtime.state.filesystem.AbstractFileStateHandle.getFileSize(AbstractFileStateHandle.java:93)
> 	at org.apache.flink.runtime.state.filesystem.FileStreamStateHandle.getStateSize(FileStreamStateHandle.java:58)
> 	at org.apache.flink.runtime.state.AbstractStateBackend$DataInputViewHandle.getStateSize(AbstractStateBackend.java:482)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTaskStateList.getStateSize(StreamTaskStateList.java:77)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:604)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:779)
> 	... 8 more
> {code}
> File actually exists on S3. 
> I suppose it is related to some race conditions with S3 but would be good to retry a
few times before stop task execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message