spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: Spark 3.0 and S3A
Date Fri, 01 Nov 2019 12:40:53 GMT
On Mon, Oct 28, 2019 at 3:40 PM Sean Owen <> wrote:

> There will be a "Hadoop 3.x" version of 3.0, as it's essential to get
> a JDK 11-compatible build. you can see the hadoop-3.2 profile.
> hadoop-aws is pulled in in the hadoop-cloud module I believe, so bears
> checking whether the profile updates the versions there too.

it does -you get hadoop-cloud-storage 3.2 which comes with an
aws-sdk-shaded jar in sync with both the s3a code and spark-kinesis.

Trying to use the hadoop 2.7 version of the s3a connector is an exercise in
painful futility. It works, but is -what- four years out of date? As well
as all the performance and scale improvements (random IO reads in
particular), it's got an out of date AWS SDK with an embedded org.json
module whose licence is now forbidden by the ASF (hence: no more ASF
releases of 2.7.x) and it doesn't really handle any of the new
v4-signature-only S3 regions.

If you ever look for "spark + s3a" you will see that the first step to
talking to S3 with the ASF releases is trying to get your classpath right
-which, given the attempts generally consist of dropping in a new AWS SDK
or hadoop-aws-3.1 JAR, means that the first question is "why do I get some
class not found exception"

As we say in the docs: randomly dropping in jars simply moves your stack
trace around

> On Mon, Oct 28, 2019 at 10:34 AM Nicholas Chammas
> <> wrote:
> >
> > Howdy folks,
> >
> > I have a question about what is happening with the 3.0 release in
> relation to Hadoop and hadoop-aws.
> >
> > Today, among other builds, we release a build of Spark built against
> Hadoop 2.7 and another one built without Hadoop. In Spark 3+, will we
> continue to release Hadoop 2.7 builds as one of the primary downloads on
> the download page? Or will we start building Spark against a newer version
> of Hadoop?
> >
> > The reason I ask is because successive versions of hadoop-aws have made
> significant usability improvements to S3A. To get those, users need to
> download the Hadoop-free build of Spark and then link Spark to a version of
> Hadoop newer than 2.7. There are various dependency and runtime issues with
> trying to pair Spark built against Hadoop 2.7 with hadoop-aws 2.8 or newer.
> >
> > If we start releasing builds of Spark built against Hadoop 3.2 (or
> another recent version), users can get the latest S3A improvements via
> --packages "org.apache.hadoop:hadoop-aws:3.2.1" without needing to download
> Hadoop separately.
> >
> > Nick
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

View raw message