[ https://issues.apache.org/jira/browse/SPARK-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-23980:
------------------------------------
Assignee: (was: Apache Spark)
> Resilient Spark driver on Kubernetes
> ------------------------------------
>
> Key: SPARK-23980
> URL: https://issues.apache.org/jira/browse/SPARK-23980
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.3.0
> Reporter: Sebastian Toader
> Priority: Major
>
> The current implementation of `Spark driver` on Kubernetes is not resilient to node failures
as it’s implemented as a `Pod`. In case of a node failure Kubernetes terminates the pods
that were running on that node. Kubernetes doesn't reschedule these pods to any of the other
nodes of the cluster.
> If the `driver` is implemented as Kubernetes [Job|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/]
than it will be rescheduled to other node.
> When the driver is terminated its executors (that may run on other nodes) are terminated
by Kubernetes with some delay by [Kubernetes Garbage collection|https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/].
> This can lead to concurrency issues where the re-spawned `driver` was trying to create
new executors with same name as the executors being in the middle of being cleaned up by Kubernetes
garbage collection.
> To solve this issue the executor name must be made unique for each `driver` *instance*.
> The PR linked to this lira is an implementation of the above that creates spark driver
as a Job and ensures that executor pod names are unique per driver instance.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org
|