spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <ever...@nuna.com.INVALID>
Subject Role-based S3 access outside of EMR
Date Tue, 19 Jul 2016 21:30:10 GMT
Hi,

When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
FileSystem implementation for s3:// URLs and seems to install the necessary
S3 credentials properties, as well.

Often, it's nice during development to run outside of a cluster even with
the "local" Spark master, though, which I've found to be more troublesome.
I'm curious if I'm doing this the right way.

There are two issues -- AWS credentials and finding the right combination
of compatible AWS SDK and Hadoop S3 FileSystem dependencies.

*Credentials and Hadoop Configuration*

For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and
AWS_ACCESS_KEY_ID environment variables or putting the corresponding
properties in Hadoop XML config files, but it seems better practice to rely
on machine roles and not expose these.

What I end up doing is, in code, when not running on EMR, creating a
DefaultAWSCredentialsProviderChain
<https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html>
and then installing the following properties in the Hadoop Configuration
using it:

fs.s3.awsAccessKeyId
fs.s3n.awsAccessKeyId
fs.s3a.awsAccessKeyId
fs.s3.awsSecretAccessKey
fs.s3n.awsSecretAccessKey
fs.s3a.awsSecretAccessKey

I also set the fs.s3.impl and fs.s3n.impl properties to
org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
implementation since people usually use "s3://" URIs.

*SDK and File System Dependencies*

Some special combination
<https://issues.apache.org/jira/browse/HADOOP-12420> of the Hadoop version,
AWS SDK version, and hadoop-aws is necessary.

One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems to
be with

--packages
com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2

Is this generally what people do? Is there a better way?

I realize this isn't entirely a Spark-specific problem, but as so many
people seem to be using S3 with Spark, I imagine this community's faced the
problem a lot.

Thanks!

- Everett

Mime
View raw message