spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: How to support dependency jars and files on HDFS in standalone cluster mode?
Date Thu, 11 Jun 2015 04:50:19 GMT
Since the jars are already on HDFS, you can access them directly in your 
Spark application without using --jars

Cheng

On 6/11/15 11:04 AM, Dong Lei wrote:
>
> Hi spark-dev:
>
> I can not use a hdfs location for the “--jars” or “--files” option 
> when doing a spark-submit in a standalone cluster mode. For example:
>
>                 Spark-submit  …   --jars hdfs://ip/1.jar  …. 
>  hdfs://ip/app.jar (standalone cluster mode)
>
> will not download 1.jar to driver’s http file server(but the app.jar 
> will be downloaded to the driver’s dir).
>
> I figure out the reason spark not downloading the jars is that when 
> doing sc.addJar to http file server, the function called is Files.copy 
> which does not support a remote location.
>
> And I think if spark can download the jars and add them to http file 
> server, the classpath is not correctly set, because the classpath 
> contains remote location.
>
> So I’m trying to make it work and come up with two options, but 
> neither of them seem to be elegant, and I want to hear your advices:
>
> Option 1:
>
> Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix.
>
> This is not good because I think it breaks the scope of http file server.
>
> Option 2:
>
> Modify DriverRunner.downloadUserJar, let it download all the “--jars” 
> and “--files” with the application jar.
>
> This sounds more reasonable that option 1 for downloading files. But 
> this way I need to read the “spark.jars” and “spark.files” on 
> downloadUserJar or DriverRunnder.start and replace it with a local 
> path. How can I do that?
>
> Do you have a more elegant solution, or do we have a plan to support 
> it in the furture?
>
> Thanks
>
> Dong Lei
>


Mime
View raw message