spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
Date Mon, 04 Nov 2019 15:00:55 GMT
I'd move spark's branch-2 line to 2.9.x as

(a) spark's version of httpclient hits a bug in the AWS SDK used in
hadoop-2.8 unless you revert that patch
(b) there's only one future version of 2.8x planned, which is expected once
myself or someone else sits down to do it. After that, all CVEs will be
dealt with by "upgrade".
(c) it's actually tested against java 8, whereas versions <= 2.8 are
nominally java 7 only.
(d) Microsoft contributed a lot for Azure integration

To be fair, the fact that the 2.7 release has lasted so long is actually
pretty impressive. Core APIs stable; Kerberos under control, HDFS client
and server happy (no erasure coding, other things tho'); the lack of
support/performance for object store integration shows how things have
changed since it's release in April 2015. But that was over five years ago.

On Sat, Nov 2, 2019 at 4:36 PM Koert Kuipers <> wrote:

> i dont see how we can be close to the point where we dont need to support
> hadoop 2.x. this does not agree with the reality from my perspective, which
> is that all our clients are on hadoop 2.x. not a single one is on hadoop
> 3.x currently.

Maybe, but unlikely to be on a "vanilla" 2.7.x release except for some very
special cases where teams have taken on that task of maintaining their own

> this includes deployments of cloudera distros, hortonworks distros,

In front of me I have a git source tree whose repositories let me see the
version histories of all of these and ~HD/I too. This is a power (I can
make changes to all) and a responsibility (I could accidentally break the
nightly builds of all if I'm not careful (1)). The one thing it doesn't do
is have write access to asf gitbox, but that's only to stop me accidentally
pushing me up an internal HDP or CDH branch to the ASF/github repos (2).

CDH5.x: hadoop branch-2 with some S3A features backported from hadoop
branch-3 (i.e S3Guard). I'd call it 2.8+ though I don't know it in detail

HDP2.6.x: again, 2.8+ with abfs and gcs support.

Either way: when Spark 3.x ships it'd be up to Cloudera to deal with that

I have no idea what is going to happen there. If other people want to test
spark 3.0.0 on those platforms -go for it, but do consider that by the
commercial on-premises clusters have had a hadoop-3 option for 2+ years and
that every month the age of those 2.x-based clusters increases. In cloud,
things are transient so it doesn't matter *at all*.

> and cloud distros like emr and dataproc.
EMR is a closed-source fork of (hadoop, hbase, spark, ...) with their own
S3 connector which has never had its source seen other than in stack traces
on stack overflow. Their problem (3).

HD/I: Current with azure connectivity, doesn't worry about the rest.

dataproc: no idea. Their gcs connector has been pretty stable. They do both
branch-2 and branch-3.1 artifacts & do run the fs contract tests to help
catch regressions in our code and theirs.

For all those in-cloud deployments, if you say "min version is Hadoop 3.x
artifacts" then when they offer spark-3 they'll just do it with their build
of the hadoop-3 JARs. It's not like they have 1000+ node HDFS clusters to

> forcing us to be on older spark versions would be unfortunate for us, and
> also bad for the community (as deployments like ours help find bugs in
> spark).
Bear also in mind: because all the work with hadoop, hive, HBase etc goes
on in branch-3 code, the compatibility with those things ages too. If you
are worried about Hive, well, you need to be working with their latest
releases to get any issues you find fixed,

It's a really hard choice here: Stable dependencies versus newer ones.
Certainly hadoop stayed with an old version of guava because the upgrade
was so traumatic (it's changed now), and as for protobuf, that was so
traumatic that everyone left it stayed frozen until last month (3.3, not
3.2.x, and protoc is done in java/maven). At the same time CVEs force
Jackson updates on a fortnightly basis and the move to java 11 breaks so
much that it's a big upgrade festival on us all.

You're going to have to consider "how much suffering with Hadoop 2.7
support is justified?" and "what should be the version which is actually
shipped for people to play with". I think my stance is clear: time to move
on. You cut your test matrix in half, be confident all users reporting bugs
will be on hadoop 3.x and when you do file bugs with your peer ASF projects
they don't get closed as WONTFIX.

BTW: out of curiosity, what versions of things does Databricks build off.
ASF 2.7.x or something later?


(1) Narrator: He has accidentally broken the nightly builds of most of
these. And IBM websphere once. Breaking google cloud is still an unrealised
(2) Narrator: He has accidentally pushed up a release of an internal branch
to the ASF/github repos. Colleagues were unhappy.
(3) Pro: they don't have to worry about me directly breaking their S3
integration. Con: I could still indirectly do it elsewhere in the source
tree,  wouldn't notice, and probably wouldn't care much if they complained.

View raw message