tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-3198) Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG
Date Tue, 05 Apr 2016 19:21:25 GMT
Jason Lowe created TEZ-3198:
-------------------------------

             Summary: Shuffle failures for the trailing task in a vertex are often fatal to
the entire DAG
                 Key: TEZ-3198
                 URL: https://issues.apache.org/jira/browse/TEZ-3198
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.8.2, 0.7.0
            Reporter: Jason Lowe
            Priority: Critical
             Fix For: 0.7.1, 0.8.3


I've seen an increasing number of cases where a single-node failure caused the whole Tez DAG
to fail. These scenarios are common in that they involve the last task of a vertex attempting
to complete a shuffle where all the peer tasks have already finished shuffling.  The last
task's attempt encounters errors shuffling one of its inputs and keeps reporting it to the
AM.  Eventually the attempt decides it must be the cause of the shuffle error and fails. 
The subsequent attempts all do the same thing, and eventually we hit the task max attempts
limit and fail the vertex and DAG.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message