spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yinan Li (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
Date Tue, 01 May 2018 19:54:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460066#comment-16460066
] 

Yinan Li edited comment on SPARK-24135 at 5/1/18 7:53 PM:
----------------------------------------------------------

I agree that we should add detection for initialization errors. But I'm not sure if requesting
new executors to replace the ones that failed initialization is a good idea. External webhooks
or initializers are typically installed by cluster admins and there's always risks of bugs
in the webhooks or initializers that cause pods to fail initialization. In case of initializers,
things are worse as pods will not be able to get out of pending status if for whatever
reasons the controller that's handling a particular initializer is down. For the reasons
[~mcheah] mentioned above, it's not obvious if initialization errors should count towards
job failures. I think keeping track of how many initialization errors are seen and stopping
requesting new executors after certain threshold might be a good idea.


was (Author: liyinan926):
I agree that we should add detection for initialization errors. But I'm not sure if requesting
new executors to replace the ones that failed initialization is a good idea. External webhooks
or initializers are typically installed by cluster admins and there's always risks of bugs
in the webhooks or initializers that cause pods to fail initialization. In case of initializers,
things are worse as pods will not be able to get out of pending status if for whatever
reasons the controller that's handling a particular initializer is down. For the reasons
[~mcheah] mentioned above, it's not obvious if initialization errors should count towards
job failures. I think keeping track of how many initialization errors are seen and stopping
requesting new executors might be a good idea.

> [K8s] Executors that fail to start up because of init-container errors are not retried
and limit the executor pool size
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24135
>                 URL: https://issues.apache.org/jira/browse/SPARK-24135
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Matt Cheah
>            Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after having
been started or if executors hit the {{ERROR}} or {{DELETED}} states. When executors fail
in these ways, they are removed from the pending executors pool and the driver should retry
requesting these executors.
> However, the driver does not handle a different class of error: when the pod enters the
{{Init:Error}} state. This state comes up when the executor fails to launch because one of
its init-containers fails. Spark itself doesn't attach any init-containers to the executors.
However, custom web hooks can run on the cluster and attach init-containers to the executor
pods. Additionally, pod presets can specify init containers to run on these pods. Therefore
Spark should be handling the {{Init:Error}} cases regardless if Spark itself is aware of init-containers
or not.
> This class of error is particularly bad because when we hit this state, the failed executor
will never start, but it's still seen as pending by the executor allocator. The executor allocator
won't request more rounds of executors because its current batch hasn't been resolved to either
running or failed. Therefore we end up with being stuck with the number of executors that
successfully started before the faulty one failed to start, potentially creating a fake resource
bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message