tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kuhu Shukla (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TEZ-3198) Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG
Date Thu, 20 Sep 2018 14:20:00 GMT

     [ https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kuhu Shukla resolved TEZ-3198.
------------------------------
    Resolution: Duplicate

Yes. It will certainly allow the AM to retry the attempt sooner.

> Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG
> ------------------------------------------------------------------------------------
>
>                 Key: TEZ-3198
>                 URL: https://issues.apache.org/jira/browse/TEZ-3198
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.8.2
>            Reporter: Jason Lowe
>            Priority: Critical
>
> I've seen an increasing number of cases where a single-node failure caused the whole
Tez DAG to fail. These scenarios are common in that they involve the last task of a vertex
attempting to complete a shuffle where all the peer tasks have already finished shuffling.
 The last task's attempt encounters errors shuffling one of its inputs and keeps reporting
it to the AM.  Eventually the attempt decides it must be the cause of the shuffle error and
fails.  The subsequent attempts all do the same thing, and eventually we hit the task max
attempts limit and fail the vertex and DAG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message