I'd move spark's branch-2 line to 2.9.x as
(a) spark's version of httpclient hits a bug in the AWS SDK used in hadoop-2.8 unless you revert that patch https://issues.apache.org/jira/browse/SPARK-22919
(b) there's only one future version of 2.8x planned, which is expected once myself or someone else sits down to do it. After that, all CVEs will be dealt with by "upgrade".
(c) it's actually tested against java 8, whereas versions <= 2.8 are nominally java 7 only.
(d) Microsoft contributed a lot for Azure integration
To be fair, the fact that the 2.7 release has lasted so long is actually pretty impressive. Core APIs stable; Kerberos under control, HDFS client and server happy (no erasure coding, other things tho'); the lack of support/performance for object store integration shows how things have changed since it's release in April 2015. But that was over five years ago.
i dont see how we can be close to the point where we dont need to support hadoop 2.x. this does not agree with the reality from my perspective, which is that all our clients are on hadoop 2.x. not a single one is on hadoop 3.x currently.
Maybe, but unlikely to be on a "vanilla" 2.7.x release except for some very special cases where teams have taken on that task of maintaining their own installation.
this includes deployments of cloudera distros, hortonworks distros,
In front of me I have a git source tree whose repositories let me see the version histories of all of these and ~HD/I too. This is a power (I can make changes to all) and a responsibility (I could accidentally break the nightly builds of all if I'm not careful (1)). The one thing it doesn't do is have write access to asf gitbox, but that's only to stop me accidentally pushing me up an internal HDP or CDH branch to the ASF/github repos (2).
CDH5.x: hadoop branch-2 with some S3A features backported from hadoop branch-3 (i.e S3Guard). I'd call it 2.8+ though I don't know it in detail there.
HDP2.6.x: again, 2.8+ with abfs and gcs support.
Either way: when Spark 3.x ships it'd be up to Cloudera to deal with that release.
I have no idea what is going to happen there. If other people want to test spark 3.0.0 on those platforms -go for it, but do consider that by the commercial on-premises clusters have had a hadoop-3 option for 2+ years and that every month the age of those 2.x-based clusters increases. In cloud, things are transient so it doesn't matter *at all*.
and cloud distros like emr and dataproc.
EMR is a closed-source fork of (hadoop, hbase, spark, ...) with their own S3 connector which has never had its source seen other than in stack traces on stack overflow. Their problem (3).
HD/I: Current with azure connectivity, doesn't worry about the rest.
dataproc: no idea. Their gcs connector has been pretty stable. They do both branch-2 and branch-3.1 artifacts & do run the fs contract tests to help catch regressions in our code and theirs.
For all those in-cloud deployments, if you say "min version is Hadoop 3.x artifacts" then when they offer spark-3 they'll just do it with their build of the hadoop-3 JARs. It's not like they have 1000+ node HDFS clusters to upgrade.
forcing us to be on older spark versions would be unfortunate for us, and also bad for the community (as deployments like ours help find bugs in spark).
Bear also in mind: because all the work with hadoop, hive, HBase etc goes on in branch-3 code, the compatibility with those things ages too. If you are worried about Hive, well, you need to be working with their latest releases to get any issues you find fixed,
It's a really hard choice here: Stable dependencies versus newer ones. Certainly hadoop stayed with an old version of guava because the upgrade was so traumatic (it's changed now), and as for protobuf, that was so traumatic that everyone left it stayed frozen until last month (3.3, not 3.2.x, and protoc is done in java/maven). At the same time CVEs force Jackson updates on a fortnightly basis and the move to java 11 breaks so much that it's a big upgrade festival on us all.
You're going to have to consider "how much suffering with Hadoop 2.7 support is justified?" and "what should be the version which is actually shipped for people to play with". I think my stance is clear: time to move on. You cut your test matrix in half, be confident all users reporting bugs will be on hadoop 3.x and when you do file bugs with your peer ASF projects they don't get closed as WONTFIX.
BTW: out of curiosity, what versions of things does Databricks build off. ASF 2.7.x or something later?
(1) Narrator: He has accidentally broken the nightly builds of most of these. And IBM websphere once. Breaking google cloud is still an unrealised ambition.
(2) Narrator: He has accidentally pushed up a release of an internal branch to the ASF/github repos. Colleagues were unhappy.
(3) Pro: they don't have to worry about me directly breaking their S3 integration. Con: I could still indirectly do it elsewhere in the source tree, wouldn't notice, and probably wouldn't care much if they complained.