spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <ever...@nuna.com.INVALID>
Subject Re: Role-based S3 access outside of EMR
Date Thu, 21 Jul 2016 16:00:32 GMT
Hey,

FWIW, we are using EMR, actually, in production.

The main case I have for wanting to access S3 with Spark outside of EMR is
that during development, our developers tend to run EC2 sandbox instances
that have all the rest of our code and access to some of the input data on
S3. It'd be nice if S3 access "just worked" on these without storing the
access keys in an exposed manner.

Teng -- when you say you use EMRFS, does that mean you copied AWS's EMRFS
JAR from an EMR cluster and are using it outside? My impression is that AWS
hasn't released the EMRFS implementation as part of the aws-java-sdk, so
I'm wary of using it. Do you know if it's supported?


On Thu, Jul 21, 2016 at 2:32 AM, Gourav Sengupta <gourav.sengupta@gmail.com>
wrote:

> Hi Teng,
>
> This is totally a flashing news for me, that people cannot use EMR in
> production because its not open sourced, I think that even Werner is not
> aware of such a problem. Is EMRFS opensourced? I am curious to know what
> does HA stand for?
>
> Regards,
> Gourav
>
> On Thu, Jul 21, 2016 at 8:37 AM, Teng Qiu <tengqiu@gmail.com> wrote:
>
>> there are several reasons that AWS users do (can) not use EMR, one
>> point for us is that security compliance problem, EMR is totally not
>> open sourced, we can not use it in production system. second is that
>> EMR do not support HA yet.
>>
>> but to the original question from @Everett :
>>
>> -> Credentials and Hadoop Configuration
>>
>> as you said, best practice should be "rely on machine roles", they
>> called IAM roles.
>>
>> we are using EMRFS impl for accessing s3, it supports IAM role-based
>> access control well. you can take a look here:
>> https://github.com/zalando/spark/tree/branch-1.6-zalando
>>
>> or simply use our docker image (Dockerfile on github:
>> https://github.com/zalando/spark-appliance/tree/master/Dockerfile)
>>
>> docker run -d --net=host \
>>            -e START_MASTER="true" \
>>            -e START_WORKER="true" \
>>            -e START_WEBAPP="true" \
>>            -e START_NOTEBOOK="true" \
>>            registry.opensource.zalan.do/bi/spark:1.6.2-6
>>
>>
>> -> SDK and File System Dependencies
>>
>> as mentioned above, using EMRFS libs solved this problem:
>>
>> http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-fs.html
>>
>>
>> 2016-07-21 8:37 GMT+02:00 Gourav Sengupta <gourav.sengupta@gmail.com>:
>> > But that would mean you would be accessing data over internet increasing
>> > data read latency, data transmission failures. Why are you not using
>> EMR?
>> >
>> > Regards,
>> > Gourav
>> >
>> > On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson
>> <everett@nuna.com.invalid>
>> > wrote:
>> >>
>> >> Thanks, Andy.
>> >>
>> >> I am indeed often doing something similar, now -- copying data locally
>> >> rather than dealing with the S3 impl selection and AWS credentials
>> issues.
>> >> It'd be nice if it worked a little easier out of the box, though!
>> >>
>> >>
>> >> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson
>> >> <Andy@santacruzintegration.com> wrote:
>> >>>
>> >>> Hi Everett
>> >>>
>> >>> I always do my initial data exploration and all our product
>> development
>> >>> in my local dev env. I typically select a small data set and copy it
>> to my
>> >>> local machine
>> >>>
>> >>> My main() has an optional command line argument ‘- - runLocal’
>> Normally I
>> >>> load data from either hdfs:/// or S3n:// . If the arg is set I read
>> from
>> >>> file:///
>> >>>
>> >>> Sometime I use a CLI arg ‘- -dataFileURL’
>> >>>
>> >>> So in your case I would log into my data cluster and use “AWS s3 cp"
>> to
>> >>> copy the data into my cluster and then use “SCP” to copy the data
>> from the
>> >>> data center back to my local env.
>> >>>
>> >>> Andy
>> >>>
>> >>> From: Everett Anderson <everett@nuna.com.INVALID>
>> >>> Date: Tuesday, July 19, 2016 at 2:30 PM
>> >>> To: "user @spark" <user@spark.apache.org>
>> >>> Subject: Role-based S3 access outside of EMR
>> >>>
>> >>> Hi,
>> >>>
>> >>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
>> >>> FileSystem implementation for s3:// URLs and seems to install the
>> necessary
>> >>> S3 credentials properties, as well.
>> >>>
>> >>> Often, it's nice during development to run outside of a cluster even
>> with
>> >>> the "local" Spark master, though, which I've found to be more
>> troublesome.
>> >>> I'm curious if I'm doing this the right way.
>> >>>
>> >>> There are two issues -- AWS credentials and finding the right
>> combination
>> >>> of compatible AWS SDK and Hadoop S3 FileSystem dependencies.
>> >>>
>> >>> Credentials and Hadoop Configuration
>> >>>
>> >>> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY
>> and
>> >>> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
>> >>> properties in Hadoop XML config files, but it seems better practice
>> to rely
>> >>> on machine roles and not expose these.
>> >>>
>> >>> What I end up doing is, in code, when not running on EMR, creating a
>> >>> DefaultAWSCredentialsProviderChain and then installing the following
>> >>> properties in the Hadoop Configuration using it:
>> >>>
>> >>> fs.s3.awsAccessKeyId
>> >>> fs.s3n.awsAccessKeyId
>> >>> fs.s3a.awsAccessKeyId
>> >>> fs.s3.awsSecretAccessKey
>> >>> fs.s3n.awsSecretAccessKey
>> >>> fs.s3a.awsSecretAccessKey
>> >>>
>> >>> I also set the fs.s3.impl and fs.s3n.impl properties to
>> >>> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
>> >>> implementation since people usually use "s3://" URIs.
>> >>>
>> >>> SDK and File System Dependencies
>> >>>
>> >>> Some special combination of the Hadoop version, AWS SDK version, and
>> >>> hadoop-aws is necessary.
>> >>>
>> >>> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me
>> seems
>> >>> to be with
>> >>>
>> >>> --packages
>> >>> com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
>> >>>
>> >>> Is this generally what people do? Is there a better way?
>> >>>
>> >>> I realize this isn't entirely a Spark-specific problem, but as so many
>> >>> people seem to be using S3 with Spark, I imagine this community's
>> faced the
>> >>> problem a lot.
>> >>>
>> >>> Thanks!
>> >>>
>> >>> - Everett
>> >>>
>> >>
>> >
>>
>
>

Mime
View raw message