flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10941) Slots prematurely released which still contain unconsumed data
Date Wed, 17 Apr 2019 13:30:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820087#comment-16820087

Till Rohrmann commented on FLINK-10941:

How much effort would it be? Technically we are still supporting {{1.7.x}} and I know about
users who run into this problem. If it's too much work and we swiftly release a {{1.8.1}}
we could ask these users whether they could upgrade to this version as a compromise.

> Slots prematurely released which still contain unconsumed data 
> ---------------------------------------------------------------
>                 Key: FLINK-10941
>                 URL: https://issues.apache.org/jira/browse/FLINK-10941
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.5.5, 1.6.2, 1.7.0
>            Reporter: Qi
>            Assignee: Andrey Zagrebin
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.9.0, 1.8.1
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
> Our case is: Flink 1.5 batch mode, 32 parallelism to read data source and 4 parallelism
to write data sink.
> The read task worked perfectly with 32 TMs. However when the job was executing the write
task, since only 4 TMs were needed, other 28 TMs were released. This caused RemoteTransportException
in the write task:
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection
unexpectedly closed by remote task manager ’the_previous_TM_used_by_read_task'. This might
indicate that the remote task manager was lost.
> 	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:133)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
> 	...
> After skimming YarnFlinkResourceManager related code, it seems to me that Flink is releasing
TMs when they’re idle, regardless of whether working TMs need them.
> Put in another way, Flink seems to prematurely release slots which contain unconsumed
data and, thus, eventually release a TM which then fails a consuming task.

This message was sent by Atlassian JIRA

View raw message