From Eric Friedman <>
Subject specifying worker nodes when using the repl?
Date Mon, 19 May 2014 15:08:52 GMT

I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to get the
spark repo to use more than 2 nodes in an interactive session.

So, this works, but is non-interactive (using yarn-client as MASTER)

/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-class \
  org.apache.spark.deploy.yarn.Client \
  --jar /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.0.jar
  --class org.apache.spark.examples.SparkPi \
  --args yarn-standalone \
  --args 10 \
  --num-workers 100

There does not appear to be an (obvious?) way to get more than 2 nodes involved from the repl.

I am running the REPL like this:


. /etc/spark/conf.cloudera.spark/

export SPARK_JAR=hdfs://nameservice1/user/spark/share/lib/spark-assembly.jar


export MASTER=yarn-client

exec $SPARK_HOME/bin/spark-shell

Now if I comment out the line with `export SPARK_JAR=…’ and run this again, I get an error
like this:

14/05/19 08:03:41 ERROR Client: Error: You must set SPARK_JAR environment variable!
Usage: org.apache.spark.deploy.yarn.Client [options] 
  --jar JAR_PATH             Path to your application's JAR file (required in yarn-cluster
  --class CLASS_NAME         Name of your application's main class (required)
  --args ARGS                Arguments to be passed to your application's main class.
                             Mutliple invocations are possible, each will be passed in order.
  --num-workers NUM          Number of workers to start (Default: 2)

But none of those options are exposed at the `spark-shell’ level.

Thanks in advance for your guidance.

