spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Role-based S3 access outside of EMR
Date Thu, 21 Jul 2016 06:37:54 GMT
But that would mean you would be accessing data over internet increasing
data read latency, data transmission failures. Why are you not using EMR?

Regards,
Gourav

On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson <everett@nuna.com.invalid>
wrote:

> Thanks, Andy.
>
> I am indeed often doing something similar, now -- copying data locally
> rather than dealing with the S3 impl selection and AWS credentials issues.
> It'd be nice if it worked a little easier out of the box, though!
>
>
> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson <
> Andy@santacruzintegration.com> wrote:
>
>> Hi Everett
>>
>> I always do my initial data exploration and all our product development
>> in my local dev env. I typically select a small data set and copy it to my
>> local machine
>>
>> My main() has an optional command line argument ‘- - runLocal’ Normally I
>> load data from either hdfs:/// or S3n:// . If the arg is set I read from
>> file:///
>>
>> Sometime I use a CLI arg ‘- -dataFileURL’
>>
>> So in your case I would log into my data cluster and use “AWS s3 cp" to
>> copy the data into my cluster and then use “SCP” to copy the data from the
>> data center back to my local env.
>>
>> Andy
>>
>> From: Everett Anderson <everett@nuna.com.INVALID>
>> Date: Tuesday, July 19, 2016 at 2:30 PM
>> To: "user @spark" <user@spark.apache.org>
>> Subject: Role-based S3 access outside of EMR
>>
>> Hi,
>>
>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
>> FileSystem implementation for s3:// URLs and seems to install the
>> necessary S3 credentials properties, as well.
>>
>> Often, it's nice during development to run outside of a cluster even with
>> the "local" Spark master, though, which I've found to be more troublesome.
>> I'm curious if I'm doing this the right way.
>>
>> There are two issues -- AWS credentials and finding the right combination
>> of compatible AWS SDK and Hadoop S3 FileSystem dependencies.
>>
>> *Credentials and Hadoop Configuration*
>>
>> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and
>> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
>> properties in Hadoop XML config files, but it seems better practice to rely
>> on machine roles and not expose these.
>>
>> What I end up doing is, in code, when not running on EMR, creating a
>> DefaultAWSCredentialsProviderChain
>> <https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html>
>> and then installing the following properties in the Hadoop Configuration
>> using it:
>>
>> fs.s3.awsAccessKeyId
>> fs.s3n.awsAccessKeyId
>> fs.s3a.awsAccessKeyId
>> fs.s3.awsSecretAccessKey
>> fs.s3n.awsSecretAccessKey
>> fs.s3a.awsSecretAccessKey
>>
>> I also set the fs.s3.impl and fs.s3n.impl properties to
>> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
>> implementation since people usually use "s3://" URIs.
>>
>> *SDK and File System Dependencies*
>>
>> Some special combination
>> <https://issues.apache.org/jira/browse/HADOOP-12420> of the Hadoop
>> version, AWS SDK version, and hadoop-aws is necessary.
>>
>> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems
>> to be with
>>
>> --packages
>> com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
>>
>> Is this generally what people do? Is there a better way?
>>
>> I realize this isn't entirely a Spark-specific problem, but as so many
>> people seem to be using S3 with Spark, I imagine this community's faced the
>> problem a lot.
>>
>> Thanks!
>>
>> - Everett
>>
>>
>

Mime
View raw message