spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimension Data, LLC." <subscripti...@didata.us>
Subject Re: Spark on YARN question
Date Wed, 03 Sep 2014 00:28:12 GMT
Hi Andrew:

Ah okay, thank you for clarifying (1) and (2)... (even answering my 
unwritten question about
'yarn-cluster', too.). :)

I will definitely use the 'spark.yarn.jar' property (and stop using 
SPARK_JAR). Thanks.

Finally this, from the --help output (with my small addition in red)...

   --jars  Comma-separated list of local jars to include on
           (i.e. copy to) the driver and executor classpaths.

    I'm guessing that if proper permissions don't exist remotely for 
that 'copy', an
    exception will occur during the copy attempt? So care has to be 
taken there.


Thank you again! =:)



On 09/02/2014 06:36 PM, Andrew Or wrote:
> Hi Didata,
>
> (1) Correct. The default deploy mode is `client`, so both masters 
> `yarn` and `yarn-client` run Spark in client mode. If you explicitly 
> specify master as `yarn-cluster`, Spark will run in cluster mode. If 
> you implicitly specify one deploy mode through the master (e.g. 
> yarn-client) but set deploy mode to the opposite (e.g. cluster), Spark 
> will complain and throw an exception. :)
>
> (2) The jars passed through the `--jars` option only need to be 
> visible to the spark-submit program. Depending on the deploy mode, 
> they will be propagated to the containers (i.e. the executors, and the 
> driver in cluster mode) differently so you don't need to manually copy 
> them yourself, either through rsync'ing or uploading to HDFS. Another 
> thing is that "SPARK_JAR" is technically deprecated (you should get a 
> warning for using it). Instead, you can set "spark.yarn.jar" in your 
> conf/spark-defaults.conf on the submitter node.
>
> Let me know if you have more questions,
> -Andrew
>
>
> 2014-09-02 15:12 GMT-07:00 Dimension Data, LLC. 
> <subscriptions@didata.us <mailto:subscriptions@didata.us>>:
>
>     Hello friends:
>
>     I have a follow-up to Andrew's well articulated answer below
>     (thank you for that).
>
>     (1) I've seen both of these invocations in various places:
>
>           (a) '--master yarn'
>           (b) '--master yarn-client'
>
>         the latter of which doesn't appear in
>     '/pyspark//|//spark-submit|spark-shell --help/' output.
>
>         Is case (a) meant for cluster-mode apps (where the driver is
>     out on a YARN ApplicationMaster,
>         and case (b) for client-mode apps needing client interaction
>     locally?
>
>         Also (related), is case (b) simply shorthand for the following
>     invocation syntax?
>            '--master yarn --deploy-mode client'
>
>     (2) Seeking clarification on the first sentence below...
>
>     /    Note: To avoid a copy of the Assembly JAR every time I launch
>     a job, I place it (the lat//est//
>     //    version) at a specific (but otherwise arbitrary) location on
>     HDFS, and then set SPARK_JAR,
>         like so (//where you can thankfully use wild-cards//)//://
>     //
>     //       export
>     SPARK_JAR=hdfs://namenode:8020///path/to///spark-assembly-*.jar/
>
>         But my question here is, when specifying additional JARS like
>     this '--jars /path/to/jar1,/path/to/jar2,...'
>         to /pyspark|spark-submit|spark-shell/ commands, are those JARS
>     expected to *already* be
>         at those path locations on both the _submitter_ server, as
>     well as on YARN _worker_ servers?
>
>         In other words, the '--jars' option won't cause the command to
>     look for them locally at those path
>         locations, and then ship & place them to the same
>     path-locations remotely? They need to be there
>         already, both locally and remotely. Correct?
>
>     Thank you. :)
>     didata
>
>
>     On 09/02/2014 12:05 PM, Andrew Or wrote:
>>     Hi Greg,
>>
>>     You should not need to even manually install Spark on each of the
>>     worker nodes or put it into HDFS yourself. Spark on Yarn will
>>     ship all necessary jars (i.e. the assembly + additional jars) to
>>     each of the containers for you. You can specify additional jars
>>     that your application depends on through the --jars argument if
>>     you are using spark-submit / spark-shell / pyspark. As for
>>     environment variables, you can specify SPARK_YARN_USER_ENV on the
>>     driver node (where your application is submitted) to specify
>>     environment variables to be observed by your executors. If you
>>     are using the spark-submit / spark-shell / pyspark scripts, then
>>     you can set Spark properties in the conf/spark-defaults.conf
>>     properties file, and these will be propagated to the executors.
>>     In other words, configurations on the slave nodes don't do anything.
>>
>>     For example,
>>     $ vim conf/spark-defaults.conf // set a few properties
>>     $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
>>     $ bin/spark-shell --master yarn --jars
>>     /local/path/to/my/jar1,/another/jar2
>>
>>     Best,
>>     -Andrew
>
>

-- 
Dimension Data, LLC.
Sincerely yours,
Team Dimension Data
------------------------------------------------------------------------
Dimension Data, LLC. <https://www.didata.us> | www.didata.us 
<https://www.didata.us>
P: 212.882.1276| subscriptions@didata.us <mailto:subscriptions@didata.us>
Follow Us: https://www.LinkedIn.com/company/didata 
<https://www.LinkedIn.com/company/didata>

Dimension Data, LLC. <http://www.didata.us>
Data Analytics you can literally count on.


Mime
View raw message