flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "BoWang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-11309) Make SpillableSubpartition repeatably read to enable
Date Wed, 03 Apr 2019 12:31:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-11309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

BoWang updated FLINK-11309:
---------------------------
    Description: 
Hi all,

When running the batch WordCount example,  I configured the job execution mode as *BATCH_FORCED*,
and failover-strategy as *region*, I manually injected some errors to let the execution fail
in different phases. In some cases, the job could recovery from failover and became succeed,
but in some cases, the job retried several times and failed.

Example:
 # If the failure occurred before task read data, e.g., failed before *invokable.invoke()* in
Task.java, failover could succeed.
 # If the failure occurred after task having read data, failover did not work.

 

Problem diagnose:

Running the example described before, each ExecutionVertex is defined as a restart region,
and the ResultPartitionType between executions is *BLOCKING.*  Thus, *SpillableSubpartition*
and *SpillableSubpartitionView* are used to write/read shuffle data, and data block is described
as *BufferConsumer* stored in a list called *buffers,* when task requires input data from
*SpillableSubpartitionView,* *BufferConsumer* is REMOVED from buffers. Thus, when failures
occurred after having read data, some *BufferConsumers* have already released, although tasks
retried, the input data is incomplete.

 

Fix Proposal:
 # *BufferConsumer* should not be removed from buffers until *ExecutionVertex* terminates.
 # *SpillableSubpartition* should not be released until *ExecutionVertex* terminates.
 # *SpillableSubpartition* could creates multi *SpillableSubpartitionViews*, each of which
is corresponding to a *Execution*.

 Design doc: https://docs.google.com/document/d/1uXuJFiKODf241CKci3b0JnaF3zQ-Wt0V9wmC7kYwX-M/edit?usp=sharing

  was:
Hi all,

When running the batch WordCount example,  I configured the job execution mode as *BATCH_FORCED*,
and failover-strategy as *region*, I manually injected some errors to let the execution fail
in different phases. In some cases, the job could recovery from failover and became succeed,
but in some cases, the job retried several times and failed.

Example:
 # If the failure occurred before task read data, e.g., failed before *invokable.invoke()* in
Task.java, failover could succeed.
 # If the failure occurred after task having read data, failover did not work.

 

Problem diagnose:

Running the example described before, each ExecutionVertex is defined as a restart region,
and the ResultPartitionType between executions is *BLOCKING.*  Thus, *SpillableSubpartition*
and *SpillableSubpartitionView* are used to write/read shuffle data, and data block is described
as *BufferConsumer* stored in a list called *buffers,* when task requires input data from
*SpillableSubpartitionView,* *BufferConsumer* is REMOVED from buffers. Thus, when failures
occurred after having read data, some *BufferConsumers* have already released, although tasks
retried, the input data is incomplete.

 

Fix Proposal:
 # *BufferConsumer* should not be removed from buffers until *ExecutionVertex* terminates.
 # *SpillableSubpartition* should not be released until *ExecutionVertex* terminates.
 # *SpillableSubpartition* could creates multi *SpillableSubpartitionViews*, each of which
is corresponding to a *Execution*.

 


> Make SpillableSubpartition repeatably read to enable
> ----------------------------------------------------
>
>                 Key: FLINK-11309
>                 URL: https://issues.apache.org/jira/browse/FLINK-11309
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Operators
>    Affects Versions: 1.6.2, 1.7.0, 1.7.1
>            Reporter: BoWang
>            Assignee: BoWang
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Hi all,
> When running the batch WordCount example,  I configured the job execution mode as *BATCH_FORCED*,
and failover-strategy as *region*, I manually injected some errors to let the execution fail
in different phases. In some cases, the job could recovery from failover and became succeed,
but in some cases, the job retried several times and failed.
> Example:
>  # If the failure occurred before task read data, e.g., failed before *invokable.invoke()* in
Task.java, failover could succeed.
>  # If the failure occurred after task having read data, failover did not work.
>  
> Problem diagnose:
> Running the example described before, each ExecutionVertex is defined as a restart region,
and the ResultPartitionType between executions is *BLOCKING.*  Thus, *SpillableSubpartition*
and *SpillableSubpartitionView* are used to write/read shuffle data, and data block is described
as *BufferConsumer* stored in a list called *buffers,* when task requires input data from
*SpillableSubpartitionView,* *BufferConsumer* is REMOVED from buffers. Thus, when failures
occurred after having read data, some *BufferConsumers* have already released, although tasks
retried, the input data is incomplete.
>  
> Fix Proposal:
>  # *BufferConsumer* should not be removed from buffers until *ExecutionVertex* terminates.
>  # *SpillableSubpartition* should not be released until *ExecutionVertex* terminates.
>  # *SpillableSubpartition* could creates multi *SpillableSubpartitionViews*, each of
which is corresponding to a *Execution*.
>  Design doc: https://docs.google.com/document/d/1uXuJFiKODf241CKci3b0JnaF3zQ-Wt0V9wmC7kYwX-M/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message