spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Commented] (SPARK-23980) Resilient Spark driver on Kubernetes
Date Fri, 13 Apr 2018 14:23:00 GMT


Apache Spark commented on SPARK-23980:

User 'baluchicken' has created a pull request for this issue:

> Resilient Spark driver on Kubernetes
> ------------------------------------
>                 Key: SPARK-23980
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Sebastian Toader
>            Priority: Major
> The current implementation of `Spark driver` on Kubernetes is not resilient to node failures
as it’s implemented as a `Pod`. In case of a node failure Kubernetes terminates the pods
that were running on that node. Kubernetes doesn't reschedule these pods to any of the other
nodes of the cluster.
> If the `driver` is implemented as Kubernetes [Job|]
than it will be rescheduled to other node.
> When the driver is terminated its executors (that may run on other nodes) are terminated
by Kubernetes with some delay by [Kubernetes Garbage collection|].
> This can lead to concurrency issues where the re-spawned `driver` was trying to create
new executors with same name as the executors being in the middle of being cleaned up by Kubernetes
garbage collection.
> To solve this issue the executor name must be made unique for each `driver` *instance*.
> The PR linked to this lira is an implementation of the above that creates spark driver
as a Job and ensures that executor pod names are unique per driver instance.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message