spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: Apache Spark 3.2 Expectation
Date Sat, 27 Feb 2021 00:05:56 GMT
Sure, thank you, Hyukjin.

Bests,
Dongjoon.


On Fri, Feb 26, 2021 at 4:01 PM Hyukjin Kwon <gurwls223@gmail.com> wrote:

> I have an idea which I'll send an email to discuss next or a week after
> the next week. I did not have enough bandwidth to drive both together at
> the same time. I would appreciate if we have some more time for 3.2.
>
> In addition, It would also be great if we follow the schedule and catch
> potential blockers quickly during QA instead of when we cut RCs. That will
> considerably speed up the process and make it on time.
>
> Thanks.
>
>
> On Sat, 27 Feb 2021, 06:00 Dongjoon Hyun, <dongjoon.hyun@gmail.com> wrote:
>
>> Thank you for sharing your plan, Huaxin!
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Feb 26, 2021 at 12:20 PM huaxin gao <huaxin.gao11@gmail.com>
>> wrote:
>>
>>> Thanks Dongjoon and Xiao for the discussion. I would like to add Data
>>> Source V2 Aggregate push down to the list. I am currently working on
>>> JDBC Data Source V2 Aggregate push down, but the common code can be used
>>> for the file based V2 Data Source as well. For example, MAX and MIN can be
>>> pushed down to Parquet and Orc, since they can use statistics information
>>> to perform these operations efficiently. Quite a few users are
>>> interested in this Aggregate push down feature and the preliminary
>>> performance test for JDBC Aggregate push down is positive. So I think it is
>>> a valuable feature to add for Spark 3.2.
>>>
>>> Thanks,
>>> Huaxin
>>>
>>> On Fri, Feb 26, 2021 at 11:13 AM Xiao Li <gatorsmile@gmail.com> wrote:
>>>
>>>> Thank you, Dongjoon, for initiating this discussion. Let us keep it
>>>> open. It might take 1-2 weeks to collect from the community all the
>>>> features we plan to build and ship in 3.2 since we just finished the 3.1
>>>> voting.
>>>>
>>>>
>>>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need
>>>>> `branch-cut` in April because we took 3 month for Spark 3.1 release.
>>>>
>>>>
>>>> TBH, cutting the branch this April does not look good to me. That
>>>> means, we only have one month left for feature development of Spark 3.2.
Do
>>>> we have enough features in the current master branch? If not, are we able
>>>> to finish major features we collected here? Do they have a timeline or
>>>> project plan?
>>>>
>>>> Xiao
>>>>
>>>> Dongjoon Hyun <dongjoon.hyun@gmail.com> 于2021年2月26日周五 上午10:07写道:
>>>>
>>>>> Thank you, Mridul and Sean.
>>>>>
>>>>> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And,
>>>>> of course, it's a nice-to-have status. :)
>>>>>
>>>>> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks
>>>>> for sharing,
>>>>>
>>>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need
>>>>> `branch-cut` in April because we took 3 month for Spark 3.1 release.
>>>>>     Let's update our release roadmap of the Apache Spark website.
>>>>>
>>>>> > I'd roughly expect 3.2 in, say, July of this year, given the usual
>>>>> cadence. No reason it couldn't be a little sooner or later. There is
>>>>> already some good stuff in 3.2 and will be a good minor release in 5-6
>>>>> months.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <srowen@gmail.com> wrote:
>>>>>
>>>>>> I'd roughly expect 3.2 in, say, July of this year, given the usual
>>>>>> cadence. No reason it couldn't be a little sooner or later. There
is
>>>>>> already some good stuff in 3.2 and will be a good minor release in
5-6
>>>>>> months.
>>>>>>
>>>>>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> Since we have been preparing Apache Spark 3.2.0 in master branch
>>>>>>> since December 2020, March seems to be a good time to share our
thoughts
>>>>>>> and aspirations on Apache Spark 3.2.
>>>>>>>
>>>>>>> According to the progress on Apache Spark 3.1 release, Apache
Spark
>>>>>>> 3.2 seems to be the last minor release of this year. Given the
timeframe,
>>>>>>> we might consider the following. (This is a small set. Please
add your
>>>>>>> thoughts to this limited list.)
>>>>>>>
>>>>>>> # Languages
>>>>>>>
>>>>>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075
but
>>>>>>> slipped out. Currently, we are trying to use Scala 2.13.5 via
SPARK-34505
>>>>>>> and investigating the publishing issue. Thank you for your contributions
>>>>>>> and feedback on this.
>>>>>>>
>>>>>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
>>>>>>> Like Java 11, we need lots of support from our dependencies.
Let's see.
>>>>>>>
>>>>>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends
at
>>>>>>> 2021-12-23. So, the deprecation is not required yet, but we had
better
>>>>>>> prepare it because we don't have an ETA of Apache Spark 3.3 in
2022.
>>>>>>>
>>>>>>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>>>>>>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN
publishing.
>>>>>>> If it succeeds to revive it, we can keep publishing. Otherwise,
I believe
>>>>>>> we had better drop it from the releasing work item list officially.
>>>>>>>
>>>>>>> # Dependencies
>>>>>>>
>>>>>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop
>>>>>>> profile in Apache Spark 3.1. Currently, Spark master branch lives
on Hadoop
>>>>>>> 3.2.2's shaded clients via SPARK-33212. So far, there is one
on-going
>>>>>>> report at YARN environment. We hope it will be fixed soon at
Spark 3.2
>>>>>>> timeframe and we can move toward Hadoop 3.3.2.
>>>>>>>
>>>>>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>>>>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile
completely
>>>>>>> via SPARK-32981 and replaced the generated hive-service-rpc code
with the
>>>>>>> official dependency via SPARK-32981. We are steadily improving
this area
>>>>>>> and will consume Hive 2.3.9 if available.
>>>>>>>
>>>>>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades
K8s
>>>>>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2
in order to
>>>>>>> support K8s model 1.19.
>>>>>>>
>>>>>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using
>>>>>>> Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka
2.7 with
>>>>>>> Scala 2.12.13, but it was reverted later due to Scala 2.12.13
issue. Since
>>>>>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2
will go
>>>>>>> with Kafka Client 2.8 hopefully.
>>>>>>>
>>>>>>> # Some Features
>>>>>>>
>>>>>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with
>>>>>>> Apache Iceberg integration. Especially, we hope the on-going
function
>>>>>>> catalog SPIP and up-coming storage partitioned join SPIP can
be delivered
>>>>>>> as a part of Spark 3.2 and become an additional foundation.
>>>>>>>
>>>>>>> - Columnar Encryption: As of today, Apache Spark master branch
>>>>>>> supports columnar encryption via Apache ORC 1.6 and it's documented
via
>>>>>>> SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar
capability.
>>>>>>> Hopefully, Apache Spark 3.2 is going to be the first release
to have this
>>>>>>> feature officially. Any feedback is welcome.
>>>>>>>
>>>>>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits
for
>>>>>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer
pool support
>>>>>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support
ZSTD
>>>>>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for
event log
>>>>>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data
source. Also,
>>>>>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer
pool),
>>>>>>> too. I'm expecting more benefits.
>>>>>>>
>>>>>>> - Structure Streaming with RocksDB backend: According to the
latest
>>>>>>> update, it looks active enough for merging to master branch in
Spark 3.2.
>>>>>>>
>>>>>>> Please share your thoughts and let's build better Apache Spark
3.2
>>>>>>> together.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>

Mime
View raw message