spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Or <>
Subject Re: Spark on YARN question
Date Tue, 02 Sep 2014 22:36:22 GMT
Hi Didata,

(1) Correct. The default deploy mode is `client`, so both masters `yarn`
and `yarn-client` run Spark in client mode. If you explicitly specify
master as `yarn-cluster`, Spark will run in cluster mode. If you implicitly
specify one deploy mode through the master (e.g. yarn-client) but set
deploy mode to the opposite (e.g. cluster), Spark will complain and throw
an exception. :)

(2) The jars passed through the `--jars` option only need to be visible to
the spark-submit program. Depending on the deploy mode, they will be
propagated to the containers (i.e. the executors, and the driver in cluster
mode) differently so you don't need to manually copy them yourself, either
through rsync'ing or uploading to HDFS. Another thing is that "SPARK_JAR"
is technically deprecated (you should get a warning for using it). Instead,
you can set "spark.yarn.jar" in your conf/spark-defaults.conf on the
submitter node.

Let me know if you have more questions,

2014-09-02 15:12 GMT-07:00 Dimension Data, LLC. <>:

>  Hello friends:
> I have a follow-up to Andrew's well articulated answer below (thank you
> for that).
> (1) I've seen both of these invocations in various places:
>       (a) '--master yarn'
>       (b) '--master yarn-client'
>     the latter of which doesn't appear in '*pyspark**|**spark-submit|spark-shell
> --help*' output.
>     Is case (a) meant for cluster-mode apps (where the driver is out on a
> YARN ApplicationMaster,
>     and case (b) for client-mode apps needing client interaction locally?
>     Also (related), is case (b) simply shorthand for the following
> invocation syntax?
>        '--master yarn --deploy-mode client'
> (2) Seeking clarification on the first sentence below...
> *    Note: To avoid a copy of the Assembly JAR every time I launch a job,
> I place it (the lat**est*
> *    version) at a specific (but otherwise arbitrary) location on HDFS,
> and then set SPARK_JAR,     like so (**where you can thankfully use
> wild-cards**)**:*
> *       export SPARK_JAR=hdfs://namenode:8020/**path/to*
> */spark-assembly-*.jar*
>     But my question here is, when specifying additional JARS like this
> '--jars /path/to/jar1,/path/to/jar2,...'
>     to *pyspark|spark-submit|spark-shell* commands, are those JARS
> expected to *already* be
>     at those path locations on both the _submitter_ server, as well as on
> YARN _worker_ servers?
>     In other words, the '--jars' option won't cause the command to look
> for them locally at those path
>     locations, and then ship & place them to the same path-locations
> remotely? They need to be there
>     already, both locally and remotely. Correct?
> Thank you. :)
> didata
>  On 09/02/2014 12:05 PM, Andrew Or wrote:
> Hi Greg,
>  You should not need to even manually install Spark on each of the worker
> nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary
> jars (i.e. the assembly + additional jars) to each of the containers for
> you. You can specify additional jars that your application depends on
> through the --jars argument if you are using spark-submit / spark-shell /
> pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV
> on the driver node (where your application is submitted) to specify
> environment variables to be observed by your executors. If you are using
> the spark-submit / spark-shell / pyspark scripts, then you can set Spark
> properties in the conf/spark-defaults.conf properties file, and these will
> be propagated to the executors. In other words, configurations on the slave
> nodes don't do anything.
>  For example,
> $ vim conf/spark-defaults.conf // set a few properties
> $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
> $ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2
>  Best,
> -Andrew

View raw message