spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guru Medasani <gdm...@gmail.com>
Subject Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'
Date Tue, 04 Aug 2015 03:20:05 GMT
Hi,

I was looking at the spark-submit and spark-shell --help  on both (Spark 1.3.1 and Spark 1.5-snapshot)
versions and the Spark documentation for submitting Spark applications to YARN. It seems to
be there is some mismatch in the preferred syntax and documentation. 

Spark documentation <http://spark.apache.org/docs/latest/submitting-applications.html#master-urls>
says that we need to specify either yarn-cluster or yarn-client to connect to a yarn cluster.



yarn-client	Connect to a YARN  <http://spark.apache.org/docs/latest/running-on-yarn.html>cluster
in client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR
variable.
yarn-cluster	Connect to a YARN  <http://spark.apache.org/docs/latest/running-on-yarn.html>cluster
in cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR
variable.
In the spark-submit --help it says the following Options: --master yarn --deploy-mode cluster
or client.

Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).

I want to bring this to your attention as this is a bit confusing for someone running Spark
on YARN. For example, they look at the spark-submit help command and start using the syntax,
but when they look at online documentation or user-group mailing list, they see different
spark-submit syntax. 

From a quick discussion with other engineers at Cloudera it seems like —deploy-mode is preferred
as it is more consistent with the way things are done with other cluster managers, i.e. there
is no standalone-cluster or standalone-client masters. This applies to Mesos as well.

Either syntax works, but I would like to propose to use ‘-master yarn —deploy-mode x’
instead of ‘-master yarn-cluster or -master yarn-client’ as it is consistent with other
cluster managers . This would require updating all Spark pages related to submitting Spark
applications to YARN.

So far I’ve identified the following pages.

1) http://spark.apache.org/docs/latest/running-on-yarn.html <http://spark.apache.org/docs/latest/running-on-yarn.html>
2) http://spark.apache.org/docs/latest/submitting-applications.html#master-urls <http://spark.apache.org/docs/latest/submitting-applications.html#master-urls>

There is a JIRA to track the progress on this as well.

https://issues.apache.org/jira/browse/SPARK-9570 <https://issues.apache.org/jira/browse/SPARK-9570>
 
The option we choose dictates whether we update the documentation  or spark-submit and spark-shell
help pages.  

Any thoughts which direction we should go? 

Guru Medasani
gdmeda@gmail.com




Mime
View raw message