From Steve Loughran <>
Subject Re: time for Apache Spark 3.0?
Date Thu, 05 Apr 2018 17:44:49 GMT

On 5 Apr 2018, at 18:04, Matei Zaharia <<>>

Java 9/10 support would be great to add as well.

Be aware that the work moving hadoop core to java 9+ is still a big piece of work being undertaken
by Akira Ajisaka & colleagues at NTT

Big dependency updates and handling Oracle hiding sun.misc stuff which low level code depends
on are the troublespots, with a move to Log4J 2 going to be observably traumatic to all apps
which require a to set themselves up. As usual: any testing which can be done
early will be welcomed by all, the earlier the better

That stuff is all about getting things working: supporting the java 9 packaging model. Which
is a really compelling reason to go for it

Regarding Scala 2.12, I thought that supporting it would become easier if we change the Spark
API and ABI slightly. Basically, it is of course possible to create an alternate source tree
today, but it might be possible to share the same source files if we tweak some small things
in the methods that are overloaded across Scala and Java. I don’t remember the exact details,
but the idea was to reduce the total maintenance work needed at the cost of requiring users
to recompile their apps.

I’m personally for moving to 3.0 because of the other things we can clean up as well, e.g.
the default SQL dialect, Iterable stuff, and possibly dependency shading (a major pain point
for lots of users)

Hadoop 3 does have a shaded client, though not enough for Spark; if work identifying &
fixing the outstanding dependencies is started now, Hadoop 3.2 should be able to offer the
set of shaded libraries needed by Spark.

There's always a price to that, which is in redistributable size and it's impact on start
times, duplicate classes loaded (memory,  reduced chance of JIT recompilation, ...), and the
whole transitive-shading problem. Java 9 should be the real target for a clean solution to
all of this.
