spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Cheah (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
Date Tue, 01 May 2018 15:21:00 GMT
Matt Cheah created SPARK-24135:
----------------------------------

             Summary: [K8s] Executors that fail to start up because of init-container errors
are not retried and limit the executor pool size
                 Key: SPARK-24135
                 URL: https://issues.apache.org/jira/browse/SPARK-24135
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 2.3.0
            Reporter: Matt Cheah


In KubernetesClusterSchedulerBackend, we detect if executors disconnect after having been
started or if executors hit the {{ERROR}} or {{DELETED}} states. When executors fail in these
ways, they are removed from the pending executors pool and the driver should retry requesting
these executors.

However, the driver does not handle a different class of error: when the pod enters the {{Init:Error}} state.
This state comes up when the executor fails to launch because one of its init-containers fails.
Spark itself doesn't attach any init-containers to the executors. However, custom web hooks
can run on the cluster and attach init-containers to the executor pods. Additionally, pod
presets can specify init containers to run on these pods. Therefore Spark should be handling
the {{Init:Error}} cases regardless if Spark itself is aware of init-containers or not.

This class of error is particularly bad because when we hit this state, the failed executor
will never start, but it's still seen as pending by the executor allocator. The executor allocator
won't request more rounds of executors because its current batch hasn't been resolved to either
running or failed. Therefore we end up with being stuck with the number of executors that
successfully started before the faulty one failed to start, potentially creating a fake resource
bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message