spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Role-based S3 access outside of EMR
Date Thu, 21 Jul 2016 09:32:52 GMT
Hi Teng,

This is totally a flashing news for me, that people cannot use EMR in
production because its not open sourced, I think that even Werner is not
aware of such a problem. Is EMRFS opensourced? I am curious to know what
does HA stand for?

Regards,
Gourav

On Thu, Jul 21, 2016 at 8:37 AM, Teng Qiu <tengqiu@gmail.com> wrote:

> there are several reasons that AWS users do (can) not use EMR, one
> point for us is that security compliance problem, EMR is totally not
> open sourced, we can not use it in production system. second is that
> EMR do not support HA yet.
>
> but to the original question from @Everett :
>
> -> Credentials and Hadoop Configuration
>
> as you said, best practice should be "rely on machine roles", they
> called IAM roles.
>
> we are using EMRFS impl for accessing s3, it supports IAM role-based
> access control well. you can take a look here:
> https://github.com/zalando/spark/tree/branch-1.6-zalando
>
> or simply use our docker image (Dockerfile on github:
> https://github.com/zalando/spark-appliance/tree/master/Dockerfile)
>
> docker run -d --net=host \
>            -e START_MASTER="true" \
>            -e START_WORKER="true" \
>            -e START_WEBAPP="true" \
>            -e START_NOTEBOOK="true" \
>            registry.opensource.zalan.do/bi/spark:1.6.2-6
>
>
> -> SDK and File System Dependencies
>
> as mentioned above, using EMRFS libs solved this problem:
>
> http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-fs.html
>
>
> 2016-07-21 8:37 GMT+02:00 Gourav Sengupta <gourav.sengupta@gmail.com>:
> > But that would mean you would be accessing data over internet increasing
> > data read latency, data transmission failures. Why are you not using EMR?
> >
> > Regards,
> > Gourav
> >
> > On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson
> <everett@nuna.com.invalid>
> > wrote:
> >>
> >> Thanks, Andy.
> >>
> >> I am indeed often doing something similar, now -- copying data locally
> >> rather than dealing with the S3 impl selection and AWS credentials
> issues.
> >> It'd be nice if it worked a little easier out of the box, though!
> >>
> >>
> >> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson
> >> <Andy@santacruzintegration.com> wrote:
> >>>
> >>> Hi Everett
> >>>
> >>> I always do my initial data exploration and all our product development
> >>> in my local dev env. I typically select a small data set and copy it
> to my
> >>> local machine
> >>>
> >>> My main() has an optional command line argument ‘- - runLocal’
> Normally I
> >>> load data from either hdfs:/// or S3n:// . If the arg is set I read
> from
> >>> file:///
> >>>
> >>> Sometime I use a CLI arg ‘- -dataFileURL’
> >>>
> >>> So in your case I would log into my data cluster and use “AWS s3 cp" to
> >>> copy the data into my cluster and then use “SCP” to copy the data from
> the
> >>> data center back to my local env.
> >>>
> >>> Andy
> >>>
> >>> From: Everett Anderson <everett@nuna.com.INVALID>
> >>> Date: Tuesday, July 19, 2016 at 2:30 PM
> >>> To: "user @spark" <user@spark.apache.org>
> >>> Subject: Role-based S3 access outside of EMR
> >>>
> >>> Hi,
> >>>
> >>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
> >>> FileSystem implementation for s3:// URLs and seems to install the
> necessary
> >>> S3 credentials properties, as well.
> >>>
> >>> Often, it's nice during development to run outside of a cluster even
> with
> >>> the "local" Spark master, though, which I've found to be more
> troublesome.
> >>> I'm curious if I'm doing this the right way.
> >>>
> >>> There are two issues -- AWS credentials and finding the right
> combination
> >>> of compatible AWS SDK and Hadoop S3 FileSystem dependencies.
> >>>
> >>> Credentials and Hadoop Configuration
> >>>
> >>> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY
> and
> >>> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
> >>> properties in Hadoop XML config files, but it seems better practice to
> rely
> >>> on machine roles and not expose these.
> >>>
> >>> What I end up doing is, in code, when not running on EMR, creating a
> >>> DefaultAWSCredentialsProviderChain and then installing the following
> >>> properties in the Hadoop Configuration using it:
> >>>
> >>> fs.s3.awsAccessKeyId
> >>> fs.s3n.awsAccessKeyId
> >>> fs.s3a.awsAccessKeyId
> >>> fs.s3.awsSecretAccessKey
> >>> fs.s3n.awsSecretAccessKey
> >>> fs.s3a.awsSecretAccessKey
> >>>
> >>> I also set the fs.s3.impl and fs.s3n.impl properties to
> >>> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
> >>> implementation since people usually use "s3://" URIs.
> >>>
> >>> SDK and File System Dependencies
> >>>
> >>> Some special combination of the Hadoop version, AWS SDK version, and
> >>> hadoop-aws is necessary.
> >>>
> >>> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me
> seems
> >>> to be with
> >>>
> >>> --packages
> >>> com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
> >>>
> >>> Is this generally what people do? Is there a better way?
> >>>
> >>> I realize this isn't entirely a Spark-specific problem, but as so many
> >>> people seem to be using S3 with Spark, I imagine this community's
> faced the
> >>> problem a lot.
> >>>
> >>> Thanks!
> >>>
> >>> - Everett
> >>>
> >>
> >
>

Mime
View raw message