flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wangwj (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (FLINK-10644) Batch Job: Speculative execution
Date Sat, 01 May 2021 14:46:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332438#comment-17332438
] 

wangwj edited comment on FLINK-10644 at 5/1/21, 2:45 PM:
---------------------------------------------------------

[~trohrmann]
Hi,I have implemented this feature, and it has a very significant effect in our product
cluster.
We are colleagues of Alibaba, I will talk with you in detail on DingTalk.


was (Author: wangwj):
[~trohrmann]
Hi,I have implemented this feature, and it has a very significant effect in our product
cluster.
We are Alibaba's colleagues, I will talk with you in detail on DingTalk.

> Batch Job: Speculative execution
> --------------------------------
>
>                 Key: FLINK-10644
>                 URL: https://issues.apache.org/jira/browse/FLINK-10644
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>            Reporter: JIN SUN
>            Assignee: BoWang
>            Priority: Major
>              Labels: stale-assigned
>
> Strugglers/outlier are tasks that run slower than most of the all tasks in a Batch Job,
this somehow impact job latency, as pretty much this straggler will be in the critical path
of the job and become as the bottleneck.
> Tasks may be slow for various reasons, including hardware degradation, or software mis-configuration,
or noise neighboring. It's hard for JM to predict the runtime.
> To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark has *_speculative
execution_*. Speculative execution is a health-check procedure that checks for tasks to be
speculated, i.e. running slower in a ExecutionJobVertex than the median of all successfully
completed tasks in that EJV, Such slow tasks will be re-submitted to another TM. It will not
stop the slow tasks, but run a new copy in parallel. And will kill the others if one of them
complete.
> This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be append
later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message