spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Erlandson <>
Subject Re: SPIP: Spark on Kubernetes
Date Tue, 15 Aug 2017 15:43:02 GMT
+1 (non-binding)

On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan <>

> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend
> <>.
> We're ~6 months in, have had 5 releases
> <>.
>    - 2 Spark versions maintained (2.1, and 2.2)
>    - Extensive integration testing and refactoring efforts to maintain
>    code quality
>    - Developer
>    <> and
>    user-facing <>
>    documentation
>    - 10+ consistent code contributors from different organizations
>    <>
>    in actively maintaining and using the project, with several more members
>    involved in testing and providing feedback.
>    - The community has delivered several talks on Spark-on-Kubernetes
>    generating lots of feedback from users.
>    - In addition to these, we've seen efforts spawn off such as:
>    - HDFS on Kubernetes
>       <> with
>       Locality and Performance Experiments
>       - Kerberized access
>       <>
>       HDFS from Spark running on Kubernetes
> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>    - +1: Yeah, let's go forward and implement the SPIP.
>    - +0: Don't really care.
>    - -1: I don't think this is a good idea because of the following
>    technical reasons.
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
> SPIP: Kubernetes as A Native Cluster Manager
> Full Design Doc: link
> <>
> Kubernetes Issue:
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
> Ilan Filonenko, Sean Suchter, Kimoon Kim
> Background and Motivation
> Containerization and cluster management technologies are constantly
> evolving in the cluster computing world. Apache Spark currently implements
> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
> its own standalone cluster manager. In 2014, Google announced development
> of Kubernetes <> which has its own unique feature
> set and differentiates itself from YARN and Mesos. Since its debut, it has
> seen contributions from over 1300 contributors with over 50000 commits.
> Kubernetes has cemented itself as a core player in the cluster computing
> world, and cloud-computing providers such as Google Container Engine,
> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
> running Kubernetes clusters.
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone cluster on Kubernetes
> <>,
> there are still major advantages and significant interest in having native
> execution support. For example, this integration provides better support
> for multi-tenancy and dynamic resource allocation. It also allows users to
> run applications of different Spark versions of their choices in the same
> cluster.
> The feature is being developed in a separate fork
> <> in order to minimize risk
> to the main project during development. Since the start of the development
> in November of 2016, it has received over 100 commits from over 20
> contributors and supports two releases based on Spark 2.1 and 2.2
> respectively. Documentation is also being actively worked on both in the
> main project repository and also in the repository
> Regarding real-world use
> cases, we have seen cluster setup that uses 1000+ cores. We are also seeing
> growing interests on this project from more and more organizations.
> While it is easy to bootstrap the project in a forked repository, it is
> hard to maintain it in the long run because of the tricky process of
> rebasing onto the upstream and lack of awareness in the large Spark
> community. It would be beneficial to both the Spark and Kubernetes
> community seeing this feature being merged upstream. On one hand, it gives
> Spark users the option of running their Spark workloads along with other
> workloads that may already be running on Kubernetes, enabling better
> resource sharing and isolation, and better cluster administration. On the
> other hand, it gives Kubernetes a leap forward in the area of large-scale
> data processing by being an officially supported cluster manager for Spark.
> The risk of merging into upstream is low because most of the changes are
> purely incremental, i.e., new Kubernetes-aware implementations of existing
> interfaces/classes in Spark core are introduced. The development is also
> concentrated in a single place at resource-managers/kubernetes
> <>.
> The risk is further reduced by a comprehensive integration test framework,
> and an active and responsive community of future maintainers.
> Target Personas
> Devops, data scientists, data engineers, application developers, anyone
> who can benefit from having Kubernetes
> <> as a
> native cluster manager for Spark.
> Goals
>    -
>    Make Kubernetes a first-class cluster manager for Spark, alongside
>    Spark Standalone, Yarn, and Mesos.
>    -
>    Support both client and cluster deployment mode.
>    -
>    Support dynamic resource allocation
>    <>
>    .
>    -
>    Support Spark Java/Scala, PySpark, and Spark R applications.
>    -
>    Support secure HDFS access.
>    -
>    Allow running applications of different Spark versions in the same
>    cluster through the ability to specify the driver and executor Docker
>    images on a per-application basis.
>    -
>    Support specification and enforcement of limits on both CPU cores and
>    memory.
> Non-Goals
>    -
>    Support cluster resource scheduling and sharing beyond capabilities
>    offered natively by the Kubernetes per-namespace resource quota model.
> Proposed API Changes
> Most API changes are purely incremental, i.e., new Kubernetes-aware
> implementations of existing interfaces/classes in Spark core are
> introduced. Detailed changes are as follows.
>    -
>    A new cluster manager option KUBERNETES is introduced and some changes
>    are made to SparkSubmit to make it be aware of this option.
>    -
>    A new implementation of CoarseGrainedSchedulerBackend, namely
>    KubernetesClusterSchedulerBackend is responsible for managing the
>    creation and deletion of executor Pods through the Kubernetes API.
>    -
>    A new implementation of TaskSchedulerImpl, namely
>    KubernetesTaskSchedulerImpl, and a new implementation of TaskSetManager,
>    namely Kubernetes TaskSetManager, are introduced for Kubernetes-aware
>    task scheduling.
>    -
>    When dynamic resource allocation is enabled, a new implementation of
>    ExternalShuffleService, namely KubernetesExternalShuffleService is
>    introduced.
> Design Sketch
> Below we briefly describe the design. For more details on the design and
> architecture, please refer to the architecture documentation
> <>.
> The main idea of this design is to run Spark driver and executors inside
> Kubernetes Pods <>.
> Pods are a co-located and co-scheduled group of one or more containers run
> in a shared context. The driver is responsible for creating and destroying
> executor Pods through the Kubernetes API, while Kubernetes is fully
> responsible for scheduling the Pods to run on available nodes in the
> cluster. In the cluster mode, the driver also runs in a Pod in the cluster,
> created through the Kubernetes API by a Kubernetes-aware submission client
> called by the spark-submit script. Because the driver runs in a Pod, it
> is reachable by the executors in the cluster using its Pod IP. In the
> client mode, the driver runs outside the cluster and calls the Kubernetes
> API to create and destroy executor Pods. The driver must be routable from
> within the cluster for the executors to communicate with it.
> The main component running in the driver is the
> KubernetesClusterSchedulerBackend, an implementation of
> CoarseGrainedSchedulerBackend, which manages allocating and destroying
> executors via the Kubernetes API, as instructed by Spark core via calls to
> methods doRequestTotalExecutors and doKillExecutors, respectively. Within
> the KubernetesClusterSchedulerBackend, a separate kubernetes-pod-allocator
> thread handles the creation of new executor Pods with appropriate
> throttling and monitoring. Throttling is achieved using a feedback loop
> that makes decision on submitting new requests for executors based on
> whether previous executor Pod creation requests have completed. This
> indirection is necessary because the Kubernetes API server accepts requests
> for new Pods optimistically, with the anticipation of being able to
> eventually schedule them to run. However, it is undesirable to have a very
> large number of Pods that cannot be scheduled and stay pending within the
> cluster. The throttling mechanism gives us control over how fast an
> application scales up (which can be configured), and helps prevent Spark
> applications from DOS-ing the Kubernetes API server with too many Pod
> creation requests. The executor Pods simply run the
> CoarseGrainedExecutorBackend class from a pre-built Docker image that
> contains a Spark distribution.
> There are auxiliary and optional components: ResourceStagingServer and
> KubernetesExternalShuffleService, which serve specific purposes described
> below. The ResourceStagingServer serves as a file store (in the absence
> of a persistent storage layer in Kubernetes) for application dependencies
> uploaded from the submission client machine, which then get downloaded from
> the server by the init-containers in the driver and executor Pods. It is a
> Jetty server with JAX-RS and has two endpoints for uploading and
> downloading files, respectively. Security tokens are returned in the
> responses for file uploading and must be carried in the requests for
> downloading the files. The ResourceStagingServer is deployed as a
> Kubernetes Service
> <> backed
> by a Deployment
> <>
> in the cluster and multiple instances may be deployed in the same cluster.
> Spark applications specify which ResourceStagingServer instance to use
> through a configuration property.
> The KubernetesExternalShuffleService is used to support dynamic resource
> allocation, with which the number of executors of a Spark application can
> change at runtime based on the resource needs. It provides an additional
> endpoint for drivers that allows the shuffle service to delete driver
> termination and clean up the shuffle files associated with corresponding
> application. There are two ways of deploying the
> KubernetesExternalShuffleService: running a shuffle service Pod on each
> node in the cluster or a subset of the nodes using a DaemonSet
> <>,
> or running a shuffle service container in each of the executor Pods. In the
> first option, each shuffle service container mounts a hostPath
> <> volume.
> The same hostPath volume is also mounted by each of the executor
> containers, which must also have the environment variable SPARK_LOCAL_DIRS
> point to the hostPath. In the second option, a shuffle service container is
> co-located with an executor container in each of the executor Pods. The two
> containers share an emptyDir
> <> volume
> where the shuffle data gets written to. There may be multiple instances of
> the shuffle service deployed in a cluster that may be used for different
> versions of Spark, or for different priority levels with different resource
> quotas.
> New Kubernetes-specific configuration options are also introduced to
> facilitate specification and customization of driver and executor Pods and
> related Kubernetes resources. For example, driver and executor Pods can be
> created in a particular Kubernetes namespace and on a particular set of the
> nodes in the cluster. Users are allowed to apply labels and annotations to
> the driver and executor Pods.
> Additionally, secure HDFS support is being actively worked on following
> the design here
> <>.
> Both short-running jobs and long-running jobs that need periodic delegation
> token refresh are supported, leveraging built-in Kubernetes constructs like
> Secrets. Please refer to the design doc for details.
> Rejected DesignsResource Staging by the Driver
> A first implementation effectively included the ResourceStagingServer in
> the driver container itself. The driver container ran a custom command that
> opened an HTTP endpoint and waited for the submission client to send
> resources to it. The server would then run the driver code after it had
> received the resources from the submission client machine. The problem with
> this approach is that the submission client needs to deploy the driver in
> such a way that the driver itself would be reachable from outside of the
> cluster, but it is difficult for an automated framework which is not aware
> of the cluster's configuration to expose an arbitrary pod in a generic way.
> The Service-based design chosen allows a cluster administrator to expose
> the ResourceStagingServer in a manner that makes sense for their cluster,
> such as with an Ingress or with a NodePort service.
> Kubernetes External Shuffle Service
> Several alternatives were considered for the design of the shuffle
> service. The first design postulated the use of long-lived executor pods
> and sidecar containers in them running the shuffle service. The advantage
> of this model was that it would let us use emptyDir for sharing as opposed
> to using node local storage, which guarantees better lifecycle management
> of storage by Kubernetes. The apparent disadvantage was that it would be a
> departure from the traditional Spark methodology of keeping executors for
> only as long as required in dynamic allocation mode. It would additionally
> use up more resources than strictly necessary during the course of
> long-running jobs, partially losing the advantage of dynamic scaling.
> Another alternative considered was to use a separate shuffle service
> manager as a nameserver. This design has a few drawbacks. First, this means
> another component that needs authentication/authorization management and
> maintenance. Second, this separate component needs to be kept in sync with
> the Kubernetes cluster. Last but not least, most of functionality of this
> separate component can be performed by a combination of the in-cluster
> shuffle service and the Kubernetes API server.
> Pluggable Scheduler Backends
> Fully pluggable scheduler backends were considered as a more generalized
> solution, and remain interesting as a possible avenue for future-proofing
> against new scheduling targets.  For the purposes of this project, adding a
> new specialized scheduler backend for Kubernetes was chosen as the approach
> due to its very low impact on the core Spark code; making scheduler fully
> pluggable would be a high-impact high-risk modification to Spark’s core
> libraries. The pluggable scheduler backends effort is being tracked in
> JIRA-19700 <>.

View raw message