spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: Apache Spark 3.2 Expectation
Date Sat, 27 Feb 2021 00:00:57 GMT
I have an idea which I'll send an email to discuss next or a week after the
next week. I did not have enough bandwidth to drive both together at the
same time. I would appreciate if we have some more time for 3.2.

In addition, It would also be great if we follow the schedule and catch
potential blockers quickly during QA instead of when we cut RCs. That will
considerably speed up the process and make it on time.

Thanks.


On Sat, 27 Feb 2021, 06:00 Dongjoon Hyun, <dongjoon.hyun@gmail.com> wrote:

> Thank you for sharing your plan, Huaxin!
>
> Bests,
> Dongjoon.
>
>
> On Fri, Feb 26, 2021 at 12:20 PM huaxin gao <huaxin.gao11@gmail.com>
> wrote:
>
>> Thanks Dongjoon and Xiao for the discussion. I would like to add Data
>> Source V2 Aggregate push down to the list. I am currently working on
>> JDBC Data Source V2 Aggregate push down, but the common code can be used
>> for the file based V2 Data Source as well. For example, MAX and MIN can be
>> pushed down to Parquet and Orc, since they can use statistics information
>> to perform these operations efficiently. Quite a few users are
>> interested in this Aggregate push down feature and the preliminary
>> performance test for JDBC Aggregate push down is positive. So I think it is
>> a valuable feature to add for Spark 3.2.
>>
>> Thanks,
>> Huaxin
>>
>> On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <gatorsmile@gmail.com> wrote:
>>
>>> Thank you, Dongjoon, for initiating this discussion. Let us keep it
>>> open. It might take 1-2 weeks to collect from the community all the
>>> features we plan to build and ship in 3.2 since we just finished the 3.1
>>> voting.
>>>
>>>
>>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need
>>>> `branch-cut` in April because we took 3 month for Spark 3.1 release.
>>>
>>>
>>> TBH, cutting the branch this April does not look good to me. That means,
>>> we only have one month left for feature development of Spark 3.2. Do we
>>> have enough features in the current master branch? If not, are we able to
>>> finish major features we collected here? Do they have a timeline or project
>>> plan?
>>>
>>> Xiao
>>>
>>> Dongjoon Hyun <dongjoon.hyun@gmail.com> 于2021年2月26日周五 上午10:07写道:
>>>
>>>> Thank you, Mridul and Sean.
>>>>
>>>> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
>>>> course, it's a nice-to-have status. :)
>>>>
>>>> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks
>>>> for sharing,
>>>>
>>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need
>>>> `branch-cut` in April because we took 3 month for Spark 3.1 release.
>>>>     Let's update our release roadmap of the Apache Spark website.
>>>>
>>>> > I'd roughly expect 3.2 in, say, July of this year, given the usual
>>>> cadence. No reason it couldn't be a little sooner or later. There is
>>>> already some good stuff in 3.2 and will be a good minor release in 5-6
>>>> months.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <srowen@gmail.com> wrote:
>>>>
>>>>> I'd roughly expect 3.2 in, say, July of this year, given the usual
>>>>> cadence. No reason it couldn't be a little sooner or later. There is
>>>>> already some good stuff in 3.2 and will be a good minor release in 5-6
>>>>> months.
>>>>>
>>>>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <
>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> Since we have been preparing Apache Spark 3.2.0 in master branch
>>>>>> since December 2020, March seems to be a good time to share our thoughts
>>>>>> and aspirations on Apache Spark 3.2.
>>>>>>
>>>>>> According to the progress on Apache Spark 3.1 release, Apache Spark
>>>>>> 3.2 seems to be the last minor release of this year. Given the timeframe,
>>>>>> we might consider the following. (This is a small set. Please add
your
>>>>>> thoughts to this limited list.)
>>>>>>
>>>>>> # Languages
>>>>>>
>>>>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>>>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>>>>> and investigating the publishing issue. Thank you for your contributions
>>>>>> and feedback on this.
>>>>>>
>>>>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
>>>>>> Like Java 11, we need lots of support from our dependencies. Let's
see.
>>>>>>
>>>>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>>>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>>>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>>>>
>>>>>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>>>>>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing.
>>>>>> If it succeeds to revive it, we can keep publishing. Otherwise, I
believe
>>>>>> we had better drop it from the releasing work item list officially.
>>>>>>
>>>>>> # Dependencies
>>>>>>
>>>>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop
>>>>>> profile in Apache Spark 3.1. Currently, Spark master branch lives
on Hadoop
>>>>>> 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going
>>>>>> report at YARN environment. We hope it will be fixed soon at Spark
3.2
>>>>>> timeframe and we can move toward Hadoop 3.3.2.
>>>>>>
>>>>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>>>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile
completely
>>>>>> via SPARK-32981 and replaced the generated hive-service-rpc code
with the
>>>>>> official dependency via SPARK-32981. We are steadily improving this
area
>>>>>> and will consume Hive 2.3.9 if available.
>>>>>>
>>>>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>>>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order
to
>>>>>> support K8s model 1.19.
>>>>>>
>>>>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using
>>>>>> Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7
with
>>>>>> Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue.
Since
>>>>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will
go
>>>>>> with Kafka Client 2.8 hopefully.
>>>>>>
>>>>>> # Some Features
>>>>>>
>>>>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>>>>> Iceberg integration. Especially, we hope the on-going function catalog
SPIP
>>>>>> and up-coming storage partitioned join SPIP can be delivered as a
part of
>>>>>> Spark 3.2 and become an additional foundation.
>>>>>>
>>>>>> - Columnar Encryption: As of today, Apache Spark master branch
>>>>>> supports columnar encryption via Apache ORC 1.6 and it's documented
via
>>>>>> SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar capability.
>>>>>> Hopefully, Apache Spark 3.2 is going to be the first release to have
this
>>>>>> feature officially. Any feedback is welcome.
>>>>>>
>>>>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits
for
>>>>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool
support
>>>>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support
ZSTD
>>>>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event
log
>>>>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source.
Also,
>>>>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer
pool),
>>>>>> too. I'm expecting more benefits.
>>>>>>
>>>>>> - Structure Streaming with RocksDB backend: According to the latest
>>>>>> update, it looks active enough for merging to master branch in Spark
3.2.
>>>>>>
>>>>>> Please share your thoughts and let's build better Apache Spark 3.2
>>>>>> together.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>

Mime
View raw message