spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Cozzi <>
Subject yarn, fat-jars and lib_managed
Date Thu, 09 Jan 2014 21:32:07 GMT
I am just starting out playing with spark on our hadoop 2.2 cluster and I have a question.

The current way to submit jobs to the cluster is to create fat-jars with sbt assembly. This
approach works but I think is less than optimal in many large hadoop installation:

the way we interact with the cluster is to log into a CLI machine, which is the only authorized
to submit jobs. Now, I can not use the CLI machine as a dev environment since for security
reason the CLI and hadoop cluster is fire-walled and can not reach out to the internet, so
sbt and manven resolution does not work.

So the procedure now is:
- hack code
- sbt assembly
- rsync my spark directory to the CLI machine
- run my job.

the issue is that every time i need to shuttle large binary files (all the fat-jars) back
and forth, they are about 120Mb now, which is slow, particularly when I am working remotely
from home.

I was wondering whether a better solution would be to create normal thin-jars of my code,
which is very small, less than a Mb, and have no problem to copy every time to the cluster,
but to take advantage of the sbt-create directory lib_managed to handle dependencies. We already
have this directory that sbt handles with all the needed dependencies for the job to run.
Wouldn’t be possible to have the Spark Yarn Client take care of adding all the jars in lib_managed
to class path and distribute them to the workers automatically (and they could also be cached
across invocations of spark, after all those jars are versioned and immutable, with the possible
exception of -SNAPSHOT releases). I think that this would greatly simplify the development
procedure and remove the need of messing with ADD_JAR and SPARK_CLASSPATH.

What do you think?

View raw message