spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimension Data, LLC." <subscripti...@didata.us>
Subject Re: Spark on YARN question
Date Tue, 02 Sep 2014 22:12:06 GMT
Hello friends:

I have a follow-up to Andrew's well articulated answer below (thank you 
for that).

(1) I've seen both of these invocations in various places:

       (a) '--master yarn'
       (b) '--master yarn-client'

     the latter of which doesn't appear in 
'/pyspark//|//spark-submit|spark-shell --help/' output.

     Is case (a) meant for cluster-mode apps (where the driver is out on 
a YARN ApplicationMaster,
     and case (b) for client-mode apps needing client interaction locally?

     Also (related), is case (b) simply shorthand for the following 
invocation syntax?
        '--master yarn --deploy-mode client'

(2) Seeking clarification on the first sentence below...

/    Note: To avoid a copy of the Assembly JAR every time I launch a 
job, I place it (the lat//est//
//    version) at a specific (but otherwise arbitrary) location on HDFS, 
and then set SPARK_JAR,
     like so (//where you can thankfully use wild-cards//)//://
//
//       export 
SPARK_JAR=hdfs://namenode:8020///path/to///spark-assembly-*.jar/

     But my question here is, when specifying additional JARS like this 
'--jars /path/to/jar1,/path/to/jar2,...'
     to /pyspark|spark-submit|spark-shell/ commands, are those JARS 
expected to *already* be
     at those path locations on both the _submitter_ server, as well as 
on YARN _worker_ servers?

     In other words, the '--jars' option won't cause the command to look 
for them locally at those path
     locations, and then ship & place them to the same path-locations 
remotely? They need to be there
     already, both locally and remotely. Correct?

Thank you. :)
didata


On 09/02/2014 12:05 PM, Andrew Or wrote:
> Hi Greg,
>
> You should not need to even manually install Spark on each of the 
> worker nodes or put it into HDFS yourself. Spark on Yarn will ship all 
> necessary jars (i.e. the assembly + additional jars) to each of the 
> containers for you. You can specify additional jars that your 
> application depends on through the --jars argument if you are using 
> spark-submit / spark-shell / pyspark. As for environment variables, 
> you can specify SPARK_YARN_USER_ENV on the driver node (where your 
> application is submitted) to specify environment variables to be 
> observed by your executors. If you are using the spark-submit / 
> spark-shell / pyspark scripts, then you can set Spark properties in 
> the conf/spark-defaults.conf properties file, and these will be 
> propagated to the executors. In other words, configurations on the 
> slave nodes don't do anything.
>
> For example,
> $ vim conf/spark-defaults.conf // set a few properties
> $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
> $ bin/spark-shell --master yarn --jars 
> /local/path/to/my/jar1,/another/jar2
>
> Best,
> -Andrew

Mime
View raw message