flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Piotr Nowojski (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-14118) Reduce the unnecessary flushing when there is no data available for flush
Date Fri, 04 Oct 2019 07:26:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944286#comment-16944286
] 

Piotr Nowojski commented on FLINK-14118:
----------------------------------------

There were some smaller changes, probably insignificant changes. Still I wouldn't like to
risk introducing some critical bug/regression:
1. Based on how fragile network stack can be for a subtle bugs and the way how not well tested
are our bug fixes releases I wouldn't be back-porting it. 
2. If we merge it to release-1.9 branch now, I'm pretty sure this improvement would be released
as part of 1.9.x branch way sooner then 1.10.
3. For me this not necessarily a bug, but a new feature/improvement. Me and Nico were aware
of this potential regression, but were thinking that the fix would bring even more harm -
apparently incorrectly.
4. Nobody has reported it for 2 years. Probably only a small fraction of the users (high parallelism,
high throughput [no RocksDB, light records, etc...], high ratio of idling vs busy Tasks) can
experience it and/or regression was not visible for most of the users among the general low
latency improvements.



> Reduce the unnecessary flushing when there is no data available for flush
> -------------------------------------------------------------------------
>
>                 Key: FLINK-14118
>                 URL: https://issues.apache.org/jira/browse/FLINK-14118
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>            Reporter: Yingjie Cao
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new flush implementation which works by triggering a netty user event may cause performance
regression compared to the old synchronization-based one. More specifically, when there
is exactly one BufferConsumer in the buffer queue of subpartition and no new data will be
added for a while in the future (may because of just no input or the logic of the operator
is to collect some data for processing and will not emit records immediately), that is, there
is no data to send, the OutputFlusher will continuously notify data available and wake up
the netty thread, though no data will be returned by the pollBuffer method.
> For some of our production jobs, this will incur 20% to 40% CPU overhead compared to
the old implementation. We tried to fix the problem by checking if there is new data available
when flushing, if there is no new data, the netty thread will not be notified. It works for
our jobs and the cpu usage falls to previous level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message