Hi,

We're currently using an EMR cluster (which uses YARN) but submitting Spark jobs to it using spark-submit from different machines outside the cluster. We haven't had time to investigate using something like Livy, yet.

We also have a need to use a mix of cluster and client modes in this configuration.

Three things we've struggled with here are
  1. Configuring spark-submit with the necessary master node host & ports
  2. Setting up the cluster to support file staging
  3. S3 implementation choices
I'm curious -- how do others handle these?

Here's what we're doing in case it helps anybody --

Configuring spark-submit 

As far as I can tell, you can't tell spark-submit the YARN resource manager info on the command-line with --conf properties. You must set a SPARK_CONF_DIR or HADOOP_CONF_DIR environment variable pointing to a local directory with core-site.xml, yarn-site.xml, and optionally hive-site.xml.

However, these setting files will override what's on the cluster, so you have to be careful and try to assemble just what you need (since you might use differently configured clusters). 

A starting point is to start a cluster and grab the files out of /etc/hadoop/conf and then whittle them down.

Setting up the cluster to support file staging

Out of the box, spark-submit will fail when trying to stage files because the cluster will try to put them in /user/(local user name on the machine the job was submitted from). That directory and user won't exist on the cluster. 

I think spark.yarn.stagingDir can change the directory, but you seem to need to setup your cluster with a bootstrap action to create and give fully open write permissions.

S3 implementation choices

Back in the "Role-based S3 access outside of EMR' thread, we talked about using S3A when running with the local master on an EC2 instance, which works in Hadoop 2.7+ with the right libraries.

AWS provides their own Hadoop FileSystem implementation for S3 called EMRFS, and the default EMR cluster setup uses it for "s3://" scheme URIs. As far as I know, they haven't released this library for use elsewhere. It supports "consistency view", which uses a DynamoDB to overcome any S3 list-key inconsistency/lag for I/O ops from the cluster. Presumably, also, they maintain it and its config, and keep them up to date and performing well.

If you use cluster mode and "s3://" scheme URIs, things work fine.

However, if you use client mode, it seems like Spark will try to use the Hadoop "s3://" scheme FileSystem on the submitting host for something, and it will fail because the default implementation won't know the credentials. One work-around is to set environment variables or Hadoop conf properties with your secret keys (!).

Another solution is to use the S3A implementation in Hadoop 2.7.x or later. However, if you use "s3a://" scheme URIs, they'll also be used on the cluster -- you'll use the S3A implementation for cluster operations instead of the EMRFS implementation. 

Similarly, if you change core-site.xml locally to use the S3A implementation for "s3://" scheme URIs, that will cause the cluster to also use the S3A implementation, when it could have used EMRFS.

Haven't figured out how to work around this, yet, or if it's important.