spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <>
Subject Re: Toward an "API" for spark images used by the Kubernetes back-end
Date Thu, 22 Mar 2018 17:18:39 GMT
I would like to add that many people run Spark behind corporate proxies. It’s very common
to add http proxy to extraJavaOptions.  Being able to provide custom extraJavaOption should
be supported.

Also, Hadoop FS 2.7.3 is pretty limited wrt S3 buckets. You cannot use temporary AWS tokens.
You cannot assume roles. You cannot use KMS buckets. All of this comes out of the box on EMR
because EMR is build with it’s own customized Hadoop FS. For standalone installations, It’s
pretty common to “customize” your Spark installation using Hadoop 2.8.3 or higher. I don’t
know if a Spark container with Hadoop 2.8.3 will be a standard container. If it isn’t, I
see a lot of people creating a customized container with Hadoop FS 2.8.3

From: Rob Vesse <>
Date: Thursday, March 22, 2018 at 6:11 AM
To: "" <>
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

The difficulty with a custom Spark config is that you need to be careful that the Spark config
the user provides does not conflict with the auto-generated portions of the Spark config necessary
to make Spark on K8S work.  So part of any “API” definition might need to be what Spark
config is considered “managed” by the Kubernetes scheduler backend.

For more controlled environments - i.e. security conscious - allowing end users to provide
custom images may be a non-starter so the more we can do at the “API” level without customising
the containers the better.  A practical example of this is managing Python dependencies, one
option we’re considering is having a base image with Anaconda included and then simply projecting
a Conda environment spec into the containers (via volume mounts) and then having the container
recreate that Conda environment on startup.  That won’t work for all possible environments
e.g. those that use non-standard Conda channels but it would provide a lot of capability without
customising the images.


From: Felix Cheung <>
Date: Thursday, 22 March 2018 at 06:21
To: Holden Karau <>, Erik Erlandson <>
Cc: dev <>
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

I like being able to customize the docker image itself - but I realize this thread is more
about “API” for the stock image.

Environment is nice. Probably we need a way to set custom spark config (as a file??)

From: Holden Karau <>
Sent: Wednesday, March 21, 2018 10:44:20 PM
To: Erik Erlandson
Cc: dev
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

I’m glad this discussion is happening on dev@ :)

Personally I like customizing with shell env variables during rolling my own image, but definitely
documentation the expectations/usage of the variables is needed before we can really call
it an API.

On the related question I suspect two of the more “common” likely customizations is adding
additional jars for bootstrapping fetching from a DFS & also similarity complicated Python
dependencies (although given the Pythons support isn’t merged yet it’s hard to say what
exactly this would look like).

I could also see some vendors wanting to add some bootstrap/setup scripts to fetch keys or
other things.

What other ways do folks foresee customizing their Spark docker containers?

On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson <<>>
During the review of the recent PR to remove use of the init_container from kube pods as created
by the Kubernetes back-end, the topic of documenting the "API" for these container images
also came up. What information does the back-end provide to these containers? In what form?
What assumptions does the back-end make about the structure of these containers?  This information
is important in a scenario where a user wants to create custom images, particularly if these
are not based on the reference dockerfiles.

A related topic is deciding what such an API should look like.  For example, early incarnations
were based more purely on environment variables, which could have advantages in terms of an
API that is easy to describe in a document.  If we document the current API, should we annotate
it as Experimental?  If not, does that effectively freeze the API?

We are interested in community input about possible customization use cases and opinions on
possible API designs!

The information contained in this e-mail is confidential and/or proprietary to Capital One
and/or its affiliates and may only be used solely in performance of work or services for Capital
One. The information transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is strictly prohibited.
If you have received this communication in error, please contact the sender and delete the
material from your computer.
View raw message