spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Role-based S3 access outside of EMR
Date Sat, 23 Jul 2016 11:11:39 GMT


Amazon S3 has stronger consistency guarantees than the ASF s3 clients, it uses dynamo to do
this.

there is some work underway to do something similar atop S3a, S3guard, see https://issues.apache.org/jira/browse/HADOOP-13345
 .

Regarding IAM support in Spark, The latest version of S3A, which will ship in Hadoop 2.8,
adds: IAM, temporary credential, direct env var pickup —and the ability to add your own.

Regarding getting the relevant binaries into your app, you need a version of the hadoop-aws
library consistent with the rest of hadoop, and the version of the amazon AWS SDKs that hadoop
was built against. APIs in the SDK have changed and attempting to upgrade the amazon JAR will
fail.

There's a PR attached to SPARK-7481 which does the bundling and adds a suite of tests...it's
designed to work with Hadoop 2.7+ builds. if you are building Spark locally, please try it
and provide feedback on the PR

finally, don't try an use s3a  on hadoop-2.6...that was really in preview state, and it let
bugs surface which were fixed in 2.7.

-Steve

ps: More on S3a in Hadoop 2.8. Things will be way better! http://slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production


On 21 Jul 2016, at 17:23, Ewan Leith <ewan.leith@realitymine.com<mailto:ewan.leith@realitymine.com>>
wrote:

If you use S3A rather than S3N, it supports IAM roles.

I think you can make s3a used for s3:// style URLs so it’s consistent with your EMR paths
by adding this to your Hadoop config, probably in core-site.xml:

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A

And make sure the s3a jars are in your classpath

Thanks,
Ewan

From: Everett Anderson [mailto:everett@nuna.com.INVALID]
Sent: 21 July 2016 17:01
To: Gourav Sengupta <gourav.sengupta@gmail.com<mailto:gourav.sengupta@gmail.com>>
Cc: Teng Qiu <tengqiu@gmail.com<mailto:tengqiu@gmail.com>>; Andy Davidson <Andy@santacruzintegration.com<mailto:Andy@santacruzintegration.com>>;
user <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Role-based S3 access outside of EMR

Hey,

FWIW, we are using EMR, actually, in production.

The main case I have for wanting to access S3 with Spark outside of EMR is that during development,
our developers tend to run EC2 sandbox instances that have all the rest of our code and access
to some of the input data on S3. It'd be nice if S3 access "just worked" on these without
storing the access keys in an exposed manner.

Teng -- when you say you use EMRFS, does that mean you copied AWS's EMRFS JAR from an EMR
cluster and are using it outside? My impression is that AWS hasn't released the EMRFS implementation
as part of the aws-java-sdk, so I'm wary of using it. Do you know if it's supported?


On Thu, Jul 21, 2016 at 2:32 AM, Gourav Sengupta <gourav.sengupta@gmail.com<mailto:gourav.sengupta@gmail.com>>
wrote:
Hi Teng,
This is totally a flashing news for me, that people cannot use EMR in production because its
not open sourced, I think that even Werner is not aware of such a problem. Is EMRFS opensourced?
I am curious to know what does HA stand for?
Regards,
Gourav

On Thu, Jul 21, 2016 at 8:37 AM, Teng Qiu <tengqiu@gmail.com<mailto:tengqiu@gmail.com>>
wrote:
there are several reasons that AWS users do (can) not use EMR, one
point for us is that security compliance problem, EMR is totally not
open sourced, we can not use it in production system. second is that
EMR do not support HA yet.

but to the original question from @Everett :

-> Credentials and Hadoop Configuration

as you said, best practice should be "rely on machine roles", they
called IAM roles.

we are using EMRFS impl for accessing s3, it supports IAM role-based
access control well. you can take a look here:
https://github.com/zalando/spark/tree/branch-1.6-zalando

or simply use our docker image (Dockerfile on github:
https://github.com/zalando/spark-appliance/tree/master/Dockerfile)

docker run -d --net=host \
           -e START_MASTER="true" \
           -e START_WORKER="true" \
           -e START_WEBAPP="true" \
           -e START_NOTEBOOK="true" \
           registry.opensource.zalan.do/bi/spark:1.6.2-6<http://registry.opensource.zalan.do/bi/spark:1.6.2-6>


-> SDK and File System Dependencies

as mentioned above, using EMRFS libs solved this problem:
http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-fs.html<http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-fs.html>


2016-07-21 8:37 GMT+02:00 Gourav Sengupta <gourav.sengupta@gmail.com<mailto:gourav.sengupta@gmail.com>>:
> But that would mean you would be accessing data over internet increasing
> data read latency, data transmission failures. Why are you not using EMR?
>
> Regards,
> Gourav
>
> On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson <everett@nuna.com.invalid<mailto:everett@nuna.com.invalid>>
> wrote:
>>
>> Thanks, Andy.
>>
>> I am indeed often doing something similar, now -- copying data locally
>> rather than dealing with the S3 impl selection and AWS credentials issues.
>> It'd be nice if it worked a little easier out of the box, though!
>>
>>
>> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson
>> <Andy@santacruzintegration.com<mailto:Andy@santacruzintegration.com>>
wrote:
>>>
>>> Hi Everett
>>>
>>> I always do my initial data exploration and all our product development
>>> in my local dev env. I typically select a small data set and copy it to my
>>> local machine
>>>
>>> My main() has an optional command line argument ‘- - runLocal’ Normally I
>>> load data from either hdfs:/// or S3n:// . If the arg is set I read from
>>> file:///<file://///>
>>>
>>> Sometime I use a CLI arg ‘- -dataFileURL’
>>>
>>> So in your case I would log into my data cluster and use “AWS s3 cp" to
>>> copy the data into my cluster and then use “SCP” to copy the data from the
>>> data center back to my local env.
>>>
>>> Andy
>>>
>>> From: Everett Anderson <everett@nuna.com.INVALID<mailto:everett@nuna.com.INVALID>>
>>> Date: Tuesday, July 19, 2016 at 2:30 PM
>>> To: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
>>> Subject: Role-based S3 access outside of EMR
>>>
>>> Hi,
>>>
>>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
>>> FileSystem implementation for s3:// URLs and seems to install the necessary
>>> S3 credentials properties, as well.
>>>
>>> Often, it's nice during development to run outside of a cluster even with
>>> the "local" Spark master, though, which I've found to be more troublesome.
>>> I'm curious if I'm doing this the right way.
>>>
>>> There are two issues -- AWS credentials and finding the right combination
>>> of compatible AWS SDK and Hadoop S3 FileSystem dependencies.
>>>
>>> Credentials and Hadoop Configuration
>>>
>>> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and
>>> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
>>> properties in Hadoop XML config files, but it seems better practice to rely
>>> on machine roles and not expose these.
>>>
>>> What I end up doing is, in code, when not running on EMR, creating a
>>> DefaultAWSCredentialsProviderChain and then installing the following
>>> properties in the Hadoop Configuration using it:
>>>
>>> fs.s3.awsAccessKeyId
>>> fs.s3n.awsAccessKeyId
>>> fs.s3a.awsAccessKeyId
>>> fs.s3.awsSecretAccessKey
>>> fs.s3n.awsSecretAccessKey
>>> fs.s3a.awsSecretAccessKey
>>>
>>> I also set the fs.s3.impl and fs.s3n.impl properties to
>>> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
>>> implementation since people usually use "s3://" URIs.
>>>
>>> SDK and File System Dependencies
>>>
>>> Some special combination of the Hadoop version, AWS SDK version, and
>>> hadoop-aws is necessary.
>>>
>>> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems
>>> to be with
>>>
>>> --packages
>>> com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
>>>
>>> Is this generally what people do? Is there a better way?
>>>
>>> I realize this isn't entirely a Spark-specific problem, but as so many
>>> people seem to be using S3 with Spark, I imagine this community's faced the
>>> problem a lot.
>>>
>>> Thanks!
>>>
>>> - Everett
>>>
>>
>

Mime
View raw message