spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dong Lei <dong...@microsoft.com>
Subject How to support dependency jars and files on HDFS in standalone cluster mode?
Date Thu, 11 Jun 2015 03:04:41 GMT
Hi spark-dev:

I can not use a hdfs location for the "--jars" or "--files" option when doing a spark-submit
in a standalone cluster mode. For example:
                Spark-submit  ...   --jars hdfs://ip/1.jar  ....  hdfs://ip/app.jar (standalone
cluster mode)
will not download 1.jar to driver's http file server(but the app.jar will be downloaded to
the driver's dir).

I figure out the reason spark not downloading the jars is that when doing sc.addJar to http
file server, the function called is Files.copy which does not support a remote location.
And I think if spark can download the jars and add them to http file server, the classpath
is not correctly set, because the classpath contains remote location.

So I'm trying to make it work and come up with two options, but neither of them seem to be
elegant, and I want to hear your advices:

Option 1:
Modify HTTPFileServer.addFileToDir, let it recognize a "hdfs" prefix.

This is not good because I think it breaks the scope of http file server.

Option 2:
Modify DriverRunner.downloadUserJar, let it download all the "--jars" and "--files" with the
application jar.

This sounds more reasonable that option 1 for downloading files. But this way I need to read
the "spark.jars" and "spark.files" on downloadUserJar or DriverRunnder.start and replace it
with a local path. How can I do that?


Do you have a more elegant solution, or do we have a plan to support it in the furture?

Thanks
Dong Lei

Mime
View raw message