spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: How to support dependency jars and files on HDFS in standalone cluster mode?
Date Fri, 12 Jun 2015 11:02:32 GMT
Would you mind to file a JIRA for this? Thanks!

Cheng

On 6/11/15 2:40 PM, Dong Lei wrote:
>
> I think in standalone cluster mode, spark is supposed to do:
>
> 1.Download jars, files to driver
>
> 2.Set the driver’s class path
>
> 3.Driver setup a http file server to distribute these files
>
> 4.Worker download from driver and setup classpath
>
> Right?
>
> But somehow, the first step fails.
>
> Even if I can make the first step works(use option1), it seems that 
> the classpath in driver is not correctly set.
>
> Thanks
>
> Dong Lei
>
> *From:*Cheng Lian [mailto:lian.cs.zju@gmail.com]
> *Sent:* Thursday, June 11, 2015 2:32 PM
> *To:* Dong Lei
> *Cc:* Dianfei (Keith) Han; dev@spark.apache.org
> *Subject:* Re: How to support dependency jars and files on HDFS in 
> standalone cluster mode?
>
> Oh sorry, I mistook --jars for --files. Yeah, for jars we need to add 
> them to classpath, which is different from regular files.
>
> Cheng
>
> On 6/11/15 2:18 PM, Dong Lei wrote:
>
>     Thanks Cheng,
>
>     If I do not use --jars how can I tell spark to search the jars(and
>     files) on HDFS?
>
>     Do you mean the driver will not need to setup a HTTP file server
>     for this scenario and the worker will fetch the jars and files
>     from HDFS?
>
>     Thanks
>
>     Dong Lei
>
>     *From:*Cheng Lian [mailto:lian.cs.zju@gmail.com]
>     *Sent:* Thursday, June 11, 2015 12:50 PM
>     *To:* Dong Lei; dev@spark.apache.org <mailto:dev@spark.apache.org>
>     *Cc:* Dianfei (Keith) Han
>     *Subject:* Re: How to support dependency jars and files on HDFS in
>     standalone cluster mode?
>
>     Since the jars are already on HDFS, you can access them directly
>     in your Spark application without using --jars
>
>     Cheng
>
>     On 6/11/15 11:04 AM, Dong Lei wrote:
>
>         Hi spark-dev:
>
>         I can not use a hdfs location for the “--jars” or “--files”
>         option when doing a spark-submit in a standalone cluster mode.
>         For example:
>
>                         Spark-submit  …  --jars hdfs://ip/1.jar  ….
>          hdfs://ip/app.jar (standalone cluster mode)
>
>         will not download 1.jar to driver’s http file server(but the
>         app.jar will be downloaded to the driver’s dir).
>
>         I figure out the reason spark not downloading the jars is that
>         when doing sc.addJar to http file server, the function called
>         is Files.copy which does not support a remote location.
>
>         And I think if spark can download the jars and add them to
>         http file server, the classpath is not correctly set, because
>         the classpath contains remote location.
>
>         So I’m trying to make it work and come up with two options,
>         but neither of them seem to be elegant, and I want to hear
>         your advices:
>
>         Option 1:
>
>         Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs”
>         prefix.
>
>         This is not good because I think it breaks the scope of http
>         file server.
>
>         Option 2:
>
>         Modify DriverRunner.downloadUserJar, let it download all the
>         “--jars” and “--files” with the application jar.
>
>         This sounds more reasonable that option 1 for downloading
>         files. But this way I need to read the “spark.jars” and
>         “spark.files” on downloadUserJar or DriverRunnder.start and
>         replace it with a local path. How can I do that?
>
>         Do you have a more elegant solution, or do we have a plan to
>         support it in the furture?
>
>         Thanks
>
>         Dong Lei
>


Mime
View raw message