spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xingbo Jiang <>
Subject Spark 3.0 preview release feature list and major changes
Date Mon, 07 Oct 2019 22:02:32 GMT
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm
listing all the notable features and major changes that are ready to
test/deliver, please don't hesitate to add more to the list:

SPARK-11215 <> Multiple
columns support added to various Transformers: StringIndexer

SPARK-11150 <> Implement
Dynamic Partition Pruning

SPARK-13677 <> Support
Tree-Based Feature Transformation

SPARK-16692 <> Add

SPARK-19591 <> Add sample
weights to decision trees

SPARK-19712 <> Pushing
Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 <> R API for
Power Iteration Clustering

SPARK-20286 <> Improve
logic for timing out executors in dynamic allocation

SPARK-20636 <> Eliminate
unnecessary shuffle with adjacent Window expressions

SPARK-22148 <> Acquire new
executors to avoid hang because of blacklisting

SPARK-22796 <> Multiple
columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 <> A new
approach to do adaptive execution in Spark SQL

SPARK-23674 <> Add Spark
ML Listener for Tracking ML Pipeline Status

SPARK-23710 <> Upgrade the
built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 <> Add fit
with validation set to Gradient Boosted Trees: Python API

SPARK-24417 <> Build and
Run Spark on JDK11

SPARK-24615 <>
Accelerator-aware task scheduling for Spark

SPARK-24920 <> Allow
sharing Netty's memory pool allocators

SPARK-25250 <> Fix race
condition with tasks running when new attempt for same stage is created
leads to other task in the next attempt running on the same partition id
retry multiple times

SPARK-25341 <> Support
rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 <> Data source
for binary files

SPARK-25603 <> Generalize
Nested Column Pruning

SPARK-26132 <> Remove
support for Scala 2.11 in Spark 3.0.0

SPARK-26215 <> define
reserved keywords after SQL standard

SPARK-26412 <> Allow
Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785 <> data source
v2 API refactor: streaming write

SPARK-26956 <> remove
streaming output mode from data source v2 APIs

SPARK-27064 <> create
StreamingWrite at the beginning of streaming execution

SPARK-27119 <> Do not
infer schema when reading Hive serde table with native data source

SPARK-27225 <> Implement
join strategy hints

SPARK-27240 <> Use pandas
DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 <> Fix
deadlock between TaskMemoryManager and

SPARK-27396 <> Public APIs
for extended Columnar Processing Support

SPARK-27589 <>
Re-implement file sources with data source V2 API

SPARK-27677 <>
Disk-persisted RDD blocks served by shuffle service, and ignored for
Dynamic Allocation

SPARK-27699 <> Partially
push down disjunctive predicated in Parquet/ORC

SPARK-27763 <> Port test
cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884 <> Deprecate
Python 2 support

SPARK-27921 <> Convert
applicable *.sql tests into UDF integrated test base

SPARK-27963 <> Allow
dynamic allocation without an external shuffle service

SPARK-28177 <> Adjust post
shuffle partition number in adaptive execution

SPARK-28372 <> Document
Spark WEB UI

SPARK-28399 <>
RobustScaler feature transformer

SPARK-28426 <> Metadata
Handling in Thrift Server

SPARK-28588 <> Build a SQL
reference doc (ongoing)

SPARK-28608 <> Improve
test coverage of ThriftServer

SPARK-28753 <> Dynamically
reuse subqueries in AQE

SPARK-28855 <> Remove
outdated Experimental, Evolving annotations
SPARK-25908 <> SPARK-28980
<> Remove deprecated items
since <= 2.2.0



View raw message