flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gyula Fora (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4193) Task manager JVM crashes while deploying cancelling jobs
Date Tue, 01 Nov 2016 08:33:58 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15624754#comment-15624754
] 

Gyula Fora commented on FLINK-4193:
-----------------------------------

These issues usually happened inside the RocksDB.open(...) method during initialization of
the state backend. If you think that the refactoring can affect this then we might get lucky
:)

We are running this in production applications and haven't ported them to 1.2 but in a week
or two I will start working on that.

> Task manager JVM crashes while deploying cancelling jobs
> --------------------------------------------------------
>
>                 Key: FLINK-4193
>                 URL: https://issues.apache.org/jira/browse/FLINK-4193
>             Project: Flink
>          Issue Type: Bug
>          Components: Streaming, TaskManager
>            Reporter: Gyula Fora
>            Priority: Critical
>
> We have observed several TM crashes while deploying larger stateful streaming jobs that
use the RocksDB state backend.
> As the JVMs crash the logs don't show anything but I have uploaded all the info I have
got from the standard output.
> This indicates some GC and possibly some RocksDB issues underneath but we could not really
figure out much more.
> GC segfault
> https://gist.github.com/gyfora/9e56d4a0d4fc285a8d838e1b281ae125
> Other crashes (maybe rocks related)
> https://gist.github.com/gyfora/525c67c747873f0ff2ff2ed1682efefa
> https://gist.github.com/gyfora/b93611fde87b1f2516eeaf6bfbe8d818
> The third link shows 2 issues that happened in parallel...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message