But that would mean you would be accessing data over internet increasing data read latency, data transmission failures. Why are you not using EMR?


On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson <everett@nuna.com.invalid> wrote:
Thanks, Andy.

I am indeed often doing something similar, now -- copying data locally rather than dealing with the S3 impl selection and AWS credentials issues. It'd be nice if it worked a little easier out of the box, though!

On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson <Andy@santacruzintegration.com> wrote:
Hi Everett

I always do my initial data exploration and all our product development in my local dev env. I typically select a small data set and copy it to my local machine

My main() has an optional command line argument ‘- - runLocal’ Normally I load data from either hdfs:/// or S3n:// . If the arg is set I read from file:///

Sometime I use a CLI arg ‘- -dataFileURL’ 

So in your case I would log into my data cluster and use “AWS s3 cp" to copy the data into my cluster and then use “SCP” to copy the data from the data center back to my local env.


From: Everett Anderson <everett@nuna.com.INVALID>
Date: Tuesday, July 19, 2016 at 2:30 PM
To: "user @spark" <user@spark.apache.org>
Subject: Role-based S3 access outside of EMR


When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop FileSystem implementation for s3:// URLs and seems to install the necessary S3 credentials properties, as well.

Often, it's nice during development to run outside of a cluster even with the "local" Spark master, though, which I've found to be more troublesome. I'm curious if I'm doing this the right way.

There are two issues -- AWS credentials and finding the right combination of compatible AWS SDK and Hadoop S3 FileSystem dependencies.

Credentials and Hadoop Configuration

For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID environment variables or putting the corresponding properties in Hadoop XML config files, but it seems better practice to rely on machine roles and not expose these. 

What I end up doing is, in code, when not running on EMR, creating a DefaultAWSCredentialsProviderChain and then installing the following properties in the Hadoop Configuration using it:


I also set the fs.s3.impl and fs.s3n.impl properties to org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A implementation since people usually use "s3://" URIs.

SDK and File System Dependencies

Some special combination of the Hadoop version, AWS SDK version, and hadoop-aws is necessary.

One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems to be with

--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2

Is this generally what people do? Is there a better way?

I realize this isn't entirely a Spark-specific problem, but as so many people seem to be using S3 with Spark, I imagine this community's faced the problem a lot.


- Everett