spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anirudh Ramanathan (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
Date Thu, 03 May 2018 08:59:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462139#comment-16462139
] 

Anirudh Ramanathan edited comment on SPARK-24135 at 5/3/18 8:58 AM:
--------------------------------------------------------------------

It is increasingly common for people to write custom controllers and custom resources and
not use the built-in controllers, especially when the workloads have special characteristics.
This is the whole reason why people are working on tooling like the [operator framework|https://coreos.com/blog/introducing-operator-framework].
I don't think the future lies in shoehorning applications to use the existing controllers.
The existing controllers are a good starting point but for any custom orchestration, the recommendation
from the k8s community at large would be to write an operator which in some sense is what
we've done. So, I think moving towards the built-in controllers doesn't give us anything more.


Also, replication controllers and deployments are not used for applications with termination
semantics. They're suitable for long running services. The only "batch" type built-in controller
is the [job controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
which does implement a [backoff policy|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy]
that covers the initialization and runtime errors in containers. As I see it, we should have
safe limits for all kinds of failures to eventually give up; it's more a question of whether
this should be treated differently as a framework error.

Also, flakiness due to admission webhooks seems like it should be handled by retries in the
init container, or by some other automation, since it's outside Spark land. That makes me
apprehensive about handling such specific cases within Spark, instead of dealing with it as
"framework error" and "app error".


was (Author: foxish):
It is increasingly common for people to write custom controllers and custom resources and
not use the built-in controllers, especially when the workloads have special characteristics.
This is the whole reason why people are working on tooling like the [operator framework|https://coreos.com/blog/introducing-operator-framework].
I don't think the future lies in shoehorning applications to use the existing controllers.
The existing controllers are a good starting point but for any custom orchestration, the recommendation
from the k8s community at large would be to write an operator which in some sense is what
we've done. So, I think moving towards the built-in controllers doesn't give us anything more.


Also, replication controllers and deployments are not used for applications with termination
semantics. They're suitable for long running services. The only "batch" type built-in controller
is the [job controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
which does implement a backoff policy that covers the initialization and runtime errors in
containers. As I see it, we should have safe limits for all kinds of failures to eventually
give up; it's more a question of whether this should be treated differently as a framework
error.

Also, flakiness due to admission webhooks seems like it should be handled by retries in the
init container, or by some other automation, since it's outside Spark land. That makes me
apprehensive about handling such specific cases within Spark, instead of dealing with it as
"framework error" and "app error".

> [K8s] Executors that fail to start up because of init-container errors are not retried
and limit the executor pool size
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24135
>                 URL: https://issues.apache.org/jira/browse/SPARK-24135
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Matt Cheah
>            Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after having
been started or if executors hit the {{ERROR}} or {{DELETED}} states. When executors fail
in these ways, they are removed from the pending executors pool and the driver should retry
requesting these executors.
> However, the driver does not handle a different class of error: when the pod enters the
{{Init:Error}} state. This state comes up when the executor fails to launch because one of
its init-containers fails. Spark itself doesn't attach any init-containers to the executors.
However, custom web hooks can run on the cluster and attach init-containers to the executor
pods. Additionally, pod presets can specify init containers to run on these pods. Therefore
Spark should be handling the {{Init:Error}} cases regardless if Spark itself is aware of init-containers
or not.
> This class of error is particularly bad because when we hit this state, the failed executor
will never start, but it's still seen as pending by the executor allocator. The executor allocator
won't request more rounds of executors because its current batch hasn't been resolved to either
running or failed. Therefore we end up with being stuck with the number of executors that
successfully started before the faulty one failed to start, potentially creating a fake resource
bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message