spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Boyer <devin.bo...@mapbox.com.INVALID>
Subject Re: Can't get Spark to interface with S3A Filesystem with correct credentials
Date Thu, 05 Mar 2020 21:38:49 GMT
Thanks for the input Steven and Hariharan. I think this ended up being a
combination of bad configuration with the credential providers I was using
*and* using the wrong set of credentials for the test data I was trying to
access.

I was able to get this working with both hadoop 2.8 and 3.1 by pulling down
the correct hadoop-aws and aws-java-sdk[-bundle] bundles and fixing the
credential provider I was using for testing. It's probably the same for the
spark distribution compiled for hadoop 2.7, but since I already have a
build with a more modern hadoop version working, I may just stick with that.

Best,
Devin

On Wed, Mar 4, 2020 at 11:02 PM Hariharan <hariharan022@gmail.com> wrote:

> If you're using hadoop 2.7 or below, you may also need to use the
> following hadoop settings:
>
> fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
> fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
> fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
> fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A
>
> Hadoop 2.8 and above would have these set by default.
>
> Thanks,
> Hariharan
>
> On Thu, Mar 5, 2020 at 2:41 AM Devin Boyer
> <devin.boyer@mapbox.com.invalid> wrote:
> >
> > Hello,
> >
> > I'm attempting to run Spark within a Docker container with the hope of
> eventually running Spark on Kubernetes. Nearly all the data we currently
> process with Spark is stored in S3, so I need to be able to interface with
> it using the S3A filesystem.
> >
> > I feel like I've gotten close to getting this working but for some
> reason cannot get my local Spark installations to correctly interface with
> S3 yet.
> >
> > A basic example of what I've tried:
> >
> > Build Kubernetes docker images by downloading the
> spark-2.4.5-bin-hadoop2.7.tgz archive and building the
> kubernetes/dockerfiles/spark/Dockerfile image.
> > Run an interactive docker container using the above built image.
> > Within that container, run spark-shell. This command passes valid AWS
> credentials by setting spark.hadoop.fs.s3a.access.key and
> spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the
> hadoop-aws package by specifying the --packages
> org.apache.hadoop:hadoop-aws:2.7.3 flag.
> > Try to access the simple public file as outlined in the "Integration
> with Cloud Infrastructures" documentation by running:
> sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
> > Observe this to fail with a 403 Forbidden exception thrown by S3
> >
> >
> > I've tried a variety of other means of setting credentials (like
> exporting the standard AWS_ACCESS_KEY_ID environment variable before
> launching spark-shell), and other means of building a Spark image and
> including the appropriate libraries (see this Github repo:
> https://github.com/drboyer/spark-s3a-demo), all with the same results.
> I've tried also accessing objects within our AWS account, rather than the
> object from the public landsat-pds bucket, with the same 403 error being
> thrown.
> >
> > Can anyone help explain why I can't seem to connect to S3 successfully
> using Spark, or even explain where I could look for additional clues as to
> what's misconfigured? I've tried turning up the logging verbosity and
> didn't see much that was particularly useful, but happy to share additional
> log output too.
> >
> > Thanks for any help you can provide!
> >
> > Best,
> > Devin Boyer
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message