spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pagliari, Roberto" <>
Subject RE: Spark on EMR with S3 example (Python)
Date Tue, 14 Jul 2015 20:56:46 GMT
Hi Sujit,
I just wanted to access public datasets on Amazon. Do I still need to provide the keys?

Thank you,

From: Sujit Pal []
Sent: Tuesday, July 14, 2015 3:14 PM
To: Pagliari, Roberto
Subject: Re: Spark on EMR with S3 example (Python)

Hi Roberto,

I have written PySpark code that reads from private S3 buckets, it should be similar for public
S3 buckets as well. You need to set the AWS access and secret keys into the SparkContext,
then you can access the S3 folders and files with their s3n:// paths. Something like this:

sc = SparkContext()
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_secret_key)

mydata = sc.textFile("s3n://mybucket/my_input_folder") \
                    .map(lambda x: do_something(x)) \

You can read and write sequence files as well - these are the only 2 formats I have tried,
but I'm sure the other ones like JSON would work also. Another approach is to embed the AWS
access key and secret key into the s3n:// path.

I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its an older version
but not sure) but it works for access.

Hope this helps,

On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto <<>>
Is there an example about how to load data from a public S3 bucket in Python? I haven’t
found any.

Thank you,

View raw message