spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Is spark-env.sh supposed to be stateless?
Date Fri, 03 Jan 2014 06:33:44 GMT
In my spark-env.sh I append to the SPARK_CLASSPATH variable rather than
overriding it, because I want to support both adding a jar to all instances
of a shell (in spark-env.sh) and adding a jar to a single shell
instance (SPARK_CLASSPATH=/path/to/my.jar
/path/to/spark-shell)

That looks like this:

# spark-env.sh
export SPARK_CLASSPATH+=":/path/to/hadoop-lzo.jar"

However when my Master and workers run, they have duplicates of the
SPARK_CLASSPATH jars.  There are 3 copies of hadoop-lzo on the classpath, 2
of which are unnecessary.

The resulting command line in ps looks like this:
/path/to/java -cp
:/path/to/hadoop-lzo.jar:/path/to/hadoop-lzo.jar:/path/to/hadoop-lzo.jar:[core
spark jars] ... -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker
spark://my-host:7077

I tracked it down and the problem is that spark-env.sh is sourced 3 times:
in spark-daemon.sh, in compute-classpath.sh, and in spark-class.  Each of
those adds to the SPARK_CLASSPATH until its contents are in triplicate.

Are all of those calls necessary?  Is it possible to edit the daemon
scripts to only call spark-env.sh once?

FYI I'm starting the daemons with ./bin/start-master.sh and
./bin/start-slave.sh 1 $SPARK_URL

Thanks,
Andrew

Mime
View raw message