spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Toader (JIRA)" <>
Subject [jira] [Created] (SPARK-23980) Resilient Spark driver on Kubernetes
Date Fri, 13 Apr 2018 13:52:00 GMT
Sebastian Toader created SPARK-23980:

             Summary: Resilient Spark driver on Kubernetes
                 Key: SPARK-23980
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.3.0
            Reporter: Sebastian Toader

The current implementation of `Spark driver` on Kubernetes is not resilient to node failures
as it’s implemented as a `Pod`. In case of a node failure Kubernetes terminates the pods
that were running on that node. Kubernetes doesn't reschedule these pods to any of the other
nodes of the cluster.

If the `driver` is implemented as Kubernetes [Job|]
than it will be rescheduled to other node.

When the driver is terminated its executors (that may run on other nodes) are terminated by
Kubernetes with some delay by [Kubernetes Garbage collection|].

This can lead to concurrency issues where the re-spawned `driver` was trying to create new
executors with same name as the executors being in the middle of being cleaned up by Kubernetes
garbage collection.

To solve this issue the executor name must be made unique for each `driver` *instance*.

The PR linked to this lira is an implementation of the above that creates spark driver as
a Job and ensures that executor pod names are unique per driver instance.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message