spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jianshi Huang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-6353) Handling fatal errors of executors and decommission datanodes
Date Mon, 16 Mar 2015 09:40:38 GMT
Jianshi Huang created SPARK-6353:
------------------------------------

             Summary: Handling fatal errors of executors and decommission datanodes
                 Key: SPARK-6353
                 URL: https://issues.apache.org/jira/browse/SPARK-6353
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, YARN
            Reporter: Jianshi Huang


We're facing "No space left on device" errors lately from time to time. The job will fail
after retries. Obvious in such case, retry won't be helpful.

Sure it's the problem in the datanodes but I'm wondering if Spark Driver can handle it and
decommission the problematic datanode before retrying it. And maybe dynamically allocate another
datanode if dynamic allocation is enabled.

I think there needs to be a class of fatal errors that can't be recovered with retries. And
it's best Spark can handle it nicely.

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message