there are several reasons that AWS users do (can) not use EMR, one
point for us is that security compliance problem, EMR is totally not
open sourced, we can not use it in production system. second is that
EMR do not support HA yet.
but to the original question from @Everett :
-> Credentials and Hadoop Configuration
as you said, best practice should be "rely on machine roles", they
called IAM roles.
we are using EMRFS impl for accessing s3, it supports IAM role-based
access control well. you can take a look here:
or simply use our docker image (Dockerfile on github:
docker run -d --net=host \
-e START_MASTER="true" \
-e START_WORKER="true" \
-e START_WEBAPP="true" \
-e START_NOTEBOOK="true" \
-> SDK and File System Dependencies
as mentioned above, using EMRFS libs solved this problem:
2016-07-21 8:37 GMT+02:00 Gourav Sengupta <firstname.lastname@example.org>:
> But that would mean you would be accessing data over internet increasing
> data read latency, data transmission failures. Why are you not using EMR?
> On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson <email@example.com>
>> Thanks, Andy.
>> I am indeed often doing something similar, now -- copying data locally
>> rather than dealing with the S3 impl selection and AWS credentials issues.
>> It'd be nice if it worked a little easier out of the box, though!
>> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson
>> <Andy@santacruzintegration.com> wrote:
>>> Hi Everett
>>> I always do my initial data exploration and all our product development
>>> in my local dev env. I typically select a small data set and copy it to my
>>> local machine
>>> My main() has an optional command line argument ‘- - runLocal’ Normally I
>>> load data from either hdfs:/// or S3n:// . If the arg is set I read from
>>> Sometime I use a CLI arg ‘- -dataFileURL’
>>> So in your case I would log into my data cluster and use “AWS s3 cp" to
>>> copy the data into my cluster and then use “SCP” to copy the data from the
>>> data center back to my local env.
>>> From: Everett Anderson <firstname.lastname@example.org.INVALID>
>>> Date: Tuesday, July 19, 2016 at 2:30 PM
>>> To: "user @spark" <email@example.com>
>>> Subject: Role-based S3 access outside of EMR
>>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
>>> FileSystem implementation for s3:// URLs and seems to install the necessary
>>> S3 credentials properties, as well.
>>> Often, it's nice during development to run outside of a cluster even with
>>> the "local" Spark master, though, which I've found to be more troublesome.
>>> I'm curious if I'm doing this the right way.
>>> There are two issues -- AWS credentials and finding the right combination
>>> of compatible AWS SDK and Hadoop S3 FileSystem dependencies.
>>> Credentials and Hadoop Configuration
>>> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and
>>> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
>>> properties in Hadoop XML config files, but it seems better practice to rely
>>> on machine roles and not expose these.
>>> What I end up doing is, in code, when not running on EMR, creating a
>>> DefaultAWSCredentialsProviderChain and then installing the following
>>> properties in the Hadoop Configuration using it:
>>> I also set the fs.s3.impl and fs.s3n.impl properties to
>>> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
>>> implementation since people usually use "s3://" URIs.
>>> SDK and File System Dependencies
>>> Some special combination of the Hadoop version, AWS SDK version, and
>>> hadoop-aws is necessary.
>>> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems
>>> to be with
>>> Is this generally what people do? Is there a better way?
>>> I realize this isn't entirely a Spark-specific problem, but as so many
>>> people seem to be using S3 with Spark, I imagine this community's faced the
>>> problem a lot.
>>> - Everett