spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Vesse <rve...@dotnetrdf.org>
Subject [DISCUSS][K8S] Local dependencies with Kubernetes
Date Fri, 05 Oct 2018 14:53:36 GMT
Folks

 

One of the big limitations of the current Spark on K8S implementation is that it isn’t possible
to use local dependencies (SPARK-23153 [1]) i.e. code, JARs, data etc that only lives on the
submission client.  This basically leaves end users with several options on how to actually
run their Spark jobs under K8S:

 
Store local dependencies on some external distributed file system e.g. HDFS
Build custom images with their local dependencies
Mount local dependencies into volumes that are mounted by the K8S pods
 

In all cases the onus is on the end user to do the prep work.  Option 1 is unfortunately
rare in the environments we’re looking to deploy Spark and Option 2 tends to be a non-starter
as many of our customers whitelist approved images i.e. custom images are not permitted.

 

Option 3 is more workable but still requires the users to provide a bunch of extra config
options to configure this for simple cases or rely upon the pending pod template feature for
complex cases.

 

Ideally this would all just be handled automatically for users in the way that all other resource
managers do, the K8S backend even did this at one point in the downstream fork but after a
long discussion [2] this got dropped in favour of using Spark standard mechanisms i.e. spark-submit. 
Unfortunately this apparently was never followed through upon as it doesn’t work with master
as of today.  Moreover I am unclear how this would work in the case of Spark on K8S cluster
mode where the driver itself is inside a pod since the spark-submit mechanism is based upon
copying from the drivers filesystem to the executors via a file server on the driver, if the
driver is inside a pod it won’t be able to see local files on the submission client.  I
think this may work out of the box with client mode but I haven’t dug into that enough to
verify yet.

 

I would like to start work on addressing this problem but to be honest I am unclear where
to start with this.  It seems using the standard spark-submit mechanism is the way to go
but I’m not sure how to get around the driver pod issue.  I would appreciate any pointers
from folks who’ve looked at this previously on how and where to start on this.

 

Cheers,

 

Rob

 

[1] https://issues.apache.org/jira/browse/SPARK-23153

[2] https://lists.apache.org/thread.html/82b4ae9a2eb5ddeb3f7240ebf154f06f19b830f8b3120038e5d687a1@%3Cdev.spark.apache.org%3E


Mime
View raw message