> You can do this manually yourself via kubectl cp so it should be possible to programmatically do this since it looks like this is just a tar piped into a kubectl exec.   This would keep the relevant logic in the Kubernetes specific client which may/may not be desirable depending on whether we’re looking to just fix this for K8S or more generally.  Of course there is probably a fair bit of complexity in making this work but does that sound like something worth exploring?

Yes, kubectl cp is able to copy files from your local machine into a container in a pod. However, the pod must be up and running for this to work. So if you want to use this to upload dependencies to the driver pod, the driver pod must already be up and running. So you may not even have a chance to upload the dependencies at this point.        

On Mon, Oct 8, 2018 at 6:36 AM Rob Vesse <rvesse@dotnetrdf.org> wrote:

Folks, thanks for all the great input. Responding to various points raised:

 

Marcelo/Yinan/Felix –

 

Yes, client mode will work.  The main JAR will be automatically distributed and --jars/--files specified dependencies are also distributed though for --files user code needs to use the appropriate Spark APIs to resolve the actual path i.e. SparkFiles.get()

 

However client mode can be awkward if you want to mix spark-submit distribution with mounting dependencies via volumes since you may need to ensure that dependencies appear at the same path both on the local submission client and when mounted into the executors.  This mainly applies to the case where user code does not use SparkFiles.get() and simply tries to access the path directly.

 

Marcelo/Stavros –

 

Yes I did give the other resource managers too much credit.  >From my past experience with Mesos and Standalone I had thought this wasn’t an issue but going back and looking at what we did for both of those it appears we were entirely reliant on the shared file system (whether HDFS, NFS or other POSIX compliant filesystems e.g. Lustre).

 

Since connectivity back to the client is a potential stumbling block for cluster mode I wander if it would be better to think in reverse i.e. rather than having the driver pull from the client have the client push to the driver pod?

 

You can do this manually yourself via kubectl cp so it should be possible to programmatically do this since it looks like this is just a tar piped into a kubectl exec.   This would keep the relevant logic in the Kubernetes specific client which may/may not be desirable depending on whether we’re looking to just fix this for K8S or more generally.  Of course there is probably a fair bit of complexity in making this work but does that sound like something worth exploring?

 

I hadn’t really considered the HA aspect, a first step would be to get the basics working and then look at the HA aspect.  Although if the above theoretical approach is practical that could simply be part of restarting the driver.

 

Rob

 

 

From: Felix Cheung <felixcheung_m@hotmail.com>
Date: Sunday, 7 October 2018 at 23:00
To: Yinan Li <liyinan926@gmail.com>, Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>
Cc: Rob Vesse <rvesse@dotnetrdf.org>, dev <dev@spark.apache.org>
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes

 

Jars and libraries only accessible locally at the driver is fairly limited? Don’t you want the same on all executor?

 

 

 


From: Yinan Li <liyinan926@gmail.com>
Sent: Friday, October 5, 2018 11:25 AM
To: Stavros Kontopoulos
Cc: rvesse@dotnetrdf.org; dev
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes

 

> Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.)

 

If the driver runs on the submission client machine, yes, it should just work. If the driver runs in a pod, however, it faces the same problem as in cluster mode.

 

Yinan

 

On Fri, Oct 5, 2018 at 11:06 AM Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> wrote:

@Marcelo is correct. Mesos does not have something similar. Only Yarn does due to the distributed cache thing.

I have described most of the above in the the jira also there are some other options.

 

Best,

Stavros

 

On Fri, Oct 5, 2018 at 8:28 PM, Marcelo Vanzin <vanzin@cloudera.com.invalid> wrote:

On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse <rvesse@dotnetrdf.org> wrote:
> Ideally this would all just be handled automatically for users in the way that all other resource managers do

I think you're giving other resource managers too much credit. In
cluster mode, only YARN really distributes local dependencies, because
YARN has that feature (its distributed cache) and Spark just uses it.

Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
anything similar on the Mesos side.

There are things that could be done; e.g. if you have HDFS you could
do a restricted version of what YARN does (upload files to HDFS, and
change the "spark.jars" and "spark.files" URLs to point to HDFS
instead). Or you could turn the submission client into a file server
that the cluster-mode driver downloads files from - although that
requires connectivity from the driver back to the client.

Neither is great, but better than not having that feature.

Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.)

--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org