spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabhwan.opensou...@gmail.com>
Subject Re: Spark 3.0 preview release feature list and major changes
Date Mon, 07 Oct 2019 22:50:28 GMT
Thanks for bringing the nice summary of Spark 3.0 improvements!

I'd like to add some items from structured streaming side,

SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
Trigger implementations to Triggers.scala and avoid exposing these to the
end users (removal of deprecated)
SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add support
for Kafka headers in Structured Streaming
SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add kafka
delegation token support (there were follow-up issues to add
functionalities like support multi clusters, etc.)
SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> Introduce
new option to Kafka source: offset by timestamp (starting/ending)
SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log warn
message on possible correctness issue for multiple stateful operations in
single query

and core side,

SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New
feature: apply custom log URL pattern for executor log URLs in SHS
(follow-up issue expanded the functionality to Spark UI as well)

FYI if we count on current work in progress, there's ongoing umbrella issue
regarding rolling event log & snapshot (SPARK-28594
<https://issues.apache.org/jira/browse/SPARK-28594>) which we struggle to
get things done in Spark 3.0.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <jiangxb1987@gmail.com> wrote:

> Hi all,
>
> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here
> I'm listing all the notable features and major changes that are ready to
> test/deliver, please don't hesitate to add more to the list:
>
> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple
> columns support added to various Transformers: StringIndexer
>
> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> Implement
> Dynamic Partition Pruning
>
> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support
> Tree-Based Feature Transformation
>
> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
> MultilabelClassificationEvaluator
>
> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
> sample weights to decision trees
>
> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing
> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>
> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API for
> Power Iteration Clustering
>
> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve
> logic for timing out executors in dynamic allocation
>
> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> Eliminate
> unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire
> new executors to avoid hang because of blacklisting
>
> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple
> columns support added to various Transformers: PySpark QuantileDiscretizer
>
> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add Spark
> ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade
> the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit
> with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build and
> Run Spark on JDK11
>
> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race
> condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
>
> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support
> rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
> source for binary files
>
> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
> Generalize Nested Column Pruning
>
> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
> support for Scala 2.11 in Spark 3.0.0
>
> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
> reserved keywords after SQL standard
>
> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
> Pandas UDF to take an iterator of pd.DataFrames
>
> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data
> source v2 API refactor: streaming write
>
> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove
> streaming output mode from data source v2 APIs
>
> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
> StreamingWrite at the beginning of streaming execution
>
> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
> infer schema when reading Hive serde table with native data source
>
> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> Implement
> join strategy hints
>
> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
> pandas DataFrame for struct type argument in Scalar Pandas UDF
>
> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
> deadlock between TaskMemoryManager and
> UnsafeExternalSorter$SpillableIterator
>
> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public
> APIs for extended Columnar Processing Support
>
> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
> Re-implement file sources with data source V2 API
>
> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
> Disk-persisted RDD blocks served by shuffle service, and ignored for
> Dynamic Allocation
>
> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> Partially
> push down disjunctive predicated in Parquet/ORC
>
> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port test
> cases from PostgreSQL to Spark SQL (ongoing)
>
> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> Deprecate
> Python 2 support
>
> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert
> applicable *.sql tests into UDF integrated test base
>
> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
> dynamic allocation without an external shuffle service
>
> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust
> post shuffle partition number in adaptive execution
>
> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document
> Spark WEB UI
>
> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
> RobustScaler feature transformer
>
> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata
> Handling in Thrift Server
>
> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a
> SQL reference doc (ongoing)
>
> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve
> test coverage of ThriftServer
>
> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
> Dynamically reuse subqueries in AQE
>
> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
> outdated Experimental, Evolving annotations
> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove
> deprecated items since <= 2.2.0
>
> Cheers,
>
> Xingbo
>

Mime
View raw message