tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-3164) Surface error histograms from the AM
Date Mon, 14 Mar 2016 19:30:33 GMT
Bikas Saha created TEZ-3164:

             Summary: Surface error histograms from the AM
                 Key: TEZ-3164
                 URL: https://issues.apache.org/jira/browse/TEZ-3164
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Bikas Saha

Job tasks are constantly probing the cluster. So if there are some issues in the cluster then
jobs would be the first to notice that. If we can make these observations surface to the user
then we could quickly identify cluster issues.

Lets say a set of bad machines got added to the cluster and tasks started seeing shuffle errors
from those machines. This can slow down or hang the job. If the AM can surface increased errors
counts from source and destination machines then that could pin point the bad machines vs
having to arrive at those machines from first principles and log searching.

This message was sent by Atlassian JIRA

View raw message