spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <>
Subject Role-based S3 access outside of EMR
Date Tue, 19 Jul 2016 21:30:10 GMT

When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
FileSystem implementation for s3:// URLs and seems to install the necessary
S3 credentials properties, as well.

Often, it's nice during development to run outside of a cluster even with
the "local" Spark master, though, which I've found to be more troublesome.
I'm curious if I'm doing this the right way.

There are two issues -- AWS credentials and finding the right combination
of compatible AWS SDK and Hadoop S3 FileSystem dependencies.

*Credentials and Hadoop Configuration*

For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and
AWS_ACCESS_KEY_ID environment variables or putting the corresponding
properties in Hadoop XML config files, but it seems better practice to rely
on machine roles and not expose these.

What I end up doing is, in code, when not running on EMR, creating a
and then installing the following properties in the Hadoop Configuration
using it:


I also set the fs.s3.impl and fs.s3n.impl properties to
org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
implementation since people usually use "s3://" URIs.

*SDK and File System Dependencies*

Some special combination
<> of the Hadoop version,
AWS SDK version, and hadoop-aws is necessary.

One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems to
be with


Is this generally what people do? Is there a better way?

I realize this isn't entirely a Spark-specific problem, but as so many
people seem to be using S3 with Spark, I imagine this community's faced the
problem a lot.


- Everett

View raw message