spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sital Kedia (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-19753) Remove all shuffle files on a host in case of slave lost of fetch failure
Date Mon, 27 Feb 2017 20:31:45 GMT
Sital Kedia created SPARK-19753:
-----------------------------------

             Summary: Remove all shuffle files on a host in case of slave lost of fetch failure
                 Key: SPARK-19753
                 URL: https://issues.apache.org/jira/browse/SPARK-19753
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 2.0.1
            Reporter: Sital Kedia


Currently, when we detect fetch failure, we only remove the shuffle files produced by the
executor, while the host itself might be down and all the shuffle files are not accessible.
In case we are running multiple executors on a host, any host going down currently results
in multiple fetch failures and multiple retries of the stage, which is very inefficient. If
we remove all the shuffle files on that host, on first fetch failure, we can rerun all the
tasks on that host in a single stage retry. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message