Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm listing all the notable features and major changes that are ready to test/deliver, please don't hesitate to add more to the list:

SPARK-11215 Multiple columns support added to various Transformers: StringIndexer

SPARK-11150 Implement Dynamic Partition Pruning

SPARK-13677 Support Tree-Based Feature Transformation

SPARK-16692 Add MultilabelClassificationEvaluator

SPARK-19591 Add sample weights to decision trees

SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 R API for Power Iteration Clustering

SPARK-20286 Improve logic for timing out executors in dynamic allocation

SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions

SPARK-22148 Acquire new executors to avoid hang because of blacklisting

SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 A new approach to do adaptive execution in Spark SQL

SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status

SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API

SPARK-24417 Build and Run Spark on JDK11

SPARK-24615 Accelerator-aware task scheduling for Spark

SPARK-24920 Allow sharing Netty's memory pool allocators

SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 Data source for binary files

SPARK-25603 Generalize Nested Column Pruning

SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0

SPARK-26215 define reserved keywords after SQL standard

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 data source v2 API refactor: streaming write

SPARK-26956 remove streaming output mode from data source v2 APIs

SPARK-27064 create StreamingWrite at the beginning of streaming execution

SPARK-27119 Do not infer schema when reading Hive serde table with native data source

SPARK-27225 Implement join strategy hints

SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

SPARK-27396 Public APIs for extended Columnar Processing Support

SPARK-27589 Re-implement file sources with data source V2 API

SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC

SPARK-27763 Port test cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 Deprecate Python 2 support

SPARK-27921 Convert applicable *.sql tests into UDF integrated test base

SPARK-27963 Allow dynamic allocation without an external shuffle service

SPARK-28177 Adjust post shuffle partition number in adaptive execution

SPARK-28372 Document Spark WEB UI

SPARK-28399 RobustScaler feature transformer

SPARK-28426 Metadata Handling in Thrift Server

SPARK-28588 Build a SQL reference doc (ongoing)

SPARK-28608 Improve test coverage of ThriftServer

SPARK-28753 Dynamically reuse subqueries in AQE

SPARK-28855 Remove outdated Experimental, Evolving annotations

SPARK-25908 SPARK-28980 Remove deprecated items since <= 2.2.0


Cheers,

Xingbo