spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hariharan <hariharan...@gmail.com>
Subject Re: Can't get Spark to interface with S3A Filesystem with correct credentials
Date Thu, 05 Mar 2020 04:01:54 GMT
If you're using hadoop 2.7 or below, you may also need to use the
following hadoop settings:

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A

Hadoop 2.8 and above would have these set by default.

Thanks,
Hariharan

On Thu, Mar 5, 2020 at 2:41 AM Devin Boyer
<devin.boyer@mapbox.com.invalid> wrote:
>
> Hello,
>
> I'm attempting to run Spark within a Docker container with the hope of eventually running
Spark on Kubernetes. Nearly all the data we currently process with Spark is stored in S3,
so I need to be able to interface with it using the S3A filesystem.
>
> I feel like I've gotten close to getting this working but for some reason cannot get
my local Spark installations to correctly interface with S3 yet.
>
> A basic example of what I've tried:
>
> Build Kubernetes docker images by downloading the spark-2.4.5-bin-hadoop2.7.tgz archive
and building the kubernetes/dockerfiles/spark/Dockerfile image.
> Run an interactive docker container using the above built image.
> Within that container, run spark-shell. This command passes valid AWS credentials by
setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key using --conf flags,
and downloads the hadoop-aws package by specifying the --packages org.apache.hadoop:hadoop-aws:2.7.3
flag.
> Try to access the simple public file as outlined in the "Integration with Cloud Infrastructures"
documentation by running: sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
> Observe this to fail with a 403 Forbidden exception thrown by S3
>
>
> I've tried a variety of other means of setting credentials (like exporting the standard
AWS_ACCESS_KEY_ID environment variable before launching spark-shell), and other means of building
a Spark image and including the appropriate libraries (see this Github repo: https://github.com/drboyer/spark-s3a-demo),
all with the same results. I've tried also accessing objects within our AWS account, rather
than the object from the public landsat-pds bucket, with the same 403 error being thrown.
>
> Can anyone help explain why I can't seem to connect to S3 successfully using Spark, or
even explain where I could look for additional clues as to what's misconfigured? I've tried
turning up the logging verbosity and didn't see much that was particularly useful, but happy
to share additional log output too.
>
> Thanks for any help you can provide!
>
> Best,
> Devin Boyer

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message