tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-3075) Revamp bad node handling
Date Tue, 26 Jan 2016 18:05:39 GMT
Bikas Saha created TEZ-3075:

             Summary: Revamp bad node handling
                 Key: TEZ-3075
                 URL: https://issues.apache.org/jira/browse/TEZ-3075
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Bikas Saha

The current logic around that is derived from MR and does not work in all cases.
Things to consider
1) Have a notion of probation where machines are put out of service for a period of time (say
5m, 15m and 30m) before being given up for good. This allows more graceful handling of temporary
2) Different handling for YARN marking a node as bad vs internal heuritics
3) Bad nodes should not immediately trigger re-execution of completed work. That should be
based on presence of downstream consumers (ie existing demand for that output) and a reasonable
indication by other consumers from that node that it cannot serve results (eg. multiple reports
of read errors with that node as a source).

This message was sent by Atlassian JIRA

View raw message