Hi, Steve, 

Thanks for your comments! My major quality concern is not against Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more risky due to these changes.

To speed up the adoption of Spark 3.0, which has many other highly desirable features, I am proposing to keep Hadoop 2.x profile as the default.



On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <stevel@cloudera.com> wrote:
What is the current default value? as the 2.x releases are becoming EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release getting attention. 2.10.0 shipped yesterday, but the ".0" means there will inevitably be surprises.

One issue about using a older versions is that any problem reported -especially at stack traces you can blame me for- Will generally be met by a response of "does it go away when you upgrade?" The other issue is how much test coverage are things getting?

w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS client is there, and I the big guava update (HADOOP-16213) went in. People will either love or hate that.

No major changes in s3a code between 3.2.0 and 3.2.1; I have a large backport planned though, including changes to better handle AWS caching of 404s generatd from HEAD requests before an object was actually created.

It would be really good if the spark distributions shipped with later versions of the hadoop artifacts.

On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <lixiao@databricks.com> wrote:
The stability and quality of Hadoop 3.2 profile are unknown. The changes are massive, including Hive execution and a new version of Hive thriftserver. 

To reduce the risk, I would like to keep the current default version unchanged. When it becomes stable, we can change the default profile to Hadoop-3.2. 



On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <srowen@gmail.com> wrote:
I'm OK with that, but don't have a strong opinion nor info about the
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <dongjoon.hyun@gmail.com> wrote:
> Hi, All.
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
> Technically, we need to change two places for publishing.
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
> To minimize the change, we need to switch our default Hadoop profile.
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
> Bests,
> Dongjoon.

To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Databricks Summit - Watch the talks 

Databricks Summit - Watch the talks