flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "BoWang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-12002) Adaptive Parallelism of Job Vertex Execution
Date Mon, 01 Apr 2019 08:06:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-12002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806500#comment-16806500

BoWang commented on FLINK-12002:

Hi all, design doc is ready, any comments are appreciated: https://docs.google.com/document/d/1ZxnoJ3SOxUk1PL2xC1t-kepq28Pg20IL6eVKUUOWSKY/edit?usp=sharing

> Adaptive Parallelism of Job Vertex Execution
> --------------------------------------------
>                 Key: FLINK-12002
>                 URL: https://issues.apache.org/jira/browse/FLINK-12002
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: ryantaocer
>            Assignee: BoWang
>            Priority: Major
> In Flink the parallelism of job is a pre-specified parameter, which is usually an empirical value and thus might not be optimal for both performance and resource depending on the amount of data processed in each task.
> Furthermore, a fixed parallelism cannot scale to varying data size common in production cluster where we may not often change configurations. 
> We propose to determine the job parallelism adaptive to the actual total input data size and an ideal data size processed by each task. The ideal size is pre-specified according to the properties of the operator such as the preparation overhead compared with data processing time.
> Our basic idea of "split and merge" is to make the data dispatched evenly acorss Reducers
by spliting and/or merging data buckets produced by Map. The data density skew problem is
not covered. This kind of parallelism adjustment doesn't have data correctness issue since
it doesnt' break the condition that data with the same key is processed by a single task. 
We determine the proper parallelism of Reduce during scheduling before its actual running
and after its input been ready though not necessary total input data. In such context, apdative
parallelism is a better name. This scheduling improvement we think can benefit both batch
and stream as long as we can obtain some clues about the input data.
>  Design doc: https://docs.google.com/document/d/1ZxnoJ3SOxUk1PL2xC1t-kepq28Pg20IL6eVKUUOWSKY/edit?usp=sharing

This message was sent by Atlassian JIRA

View raw message