spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: time for Apache Spark 3.0?
Date Thu, 06 Sep 2018 16:10:05 GMT
My concern is that the v2 data source API is still evolving and not very
close to stable. I had hoped to have stabilized the API and behaviors for a
3.0 release. But we could also wait on that for a 4.0 release, depending on
when we think that will be.

Unless there is a pressing need to move to 3.0 for some other area, I think
it would be better for the v2 sources to have a 2.5 release.

On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <gatorsmile@gmail.com> wrote:

> Yesterday, the 2.4 branch was created. Based on the above discussion, I
> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>
> Thanks,
>
> Xiao
>
> vaquar khan <vaquar.khan@gmail.com> 于2018年6月16日周六 上午10:21写道:
>
>> +1  for 2.4 next, followed by 3.0.
>>
>> Where we can get Apache Spark road map for 2.4 and 2.5 .... 3.0 ?
>> is it possible we can share future release proposed specification same
>> like  releases (
>> https://spark.apache.org/releases/spark-release-2-3-0.html)
>> Regards,
>> Viquar khan
>>
>> On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan <vaquar.khan@gmail.com>
>> wrote:
>>
>>> Plz ignore last email link (you tube )not sure how it added .
>>> Apologies not sure how to delete it.
>>>
>>>
>>> On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan <vaquar.khan@gmail.com>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>>>>
>>>> Regards,
>>>> Vaquar khan
>>>>
>>>> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin <rxin@databricks.com>
>>>> wrote:
>>>>
>>>>> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>>>>>
>>>>>
>>>>> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <mridul@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I agree, I dont see pressing need for major version bump as well.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <
>>>>>> mark@clearstorydata.com> wrote:
>>>>>> >
>>>>>> > Changing major version numbers is not about new features or
a vague
>>>>>> notion that it is time to do something that will be seen to be a
>>>>>> significant release. It is about breaking stable public APIs.
>>>>>> >
>>>>>> > I still remain unconvinced that the next version can't be 2.4.0.
>>>>>> >
>>>>>> > On Fri, Jun 15, 2018 at 1:34 AM Andy <andyyehoo@gmail.com>
wrote:
>>>>>> >>
>>>>>> >> Dear all:
>>>>>> >>
>>>>>> >> It have been 2 months since this topic being proposed. Any
>>>>>> progress now? 2018 has been passed about 1/2.
>>>>>> >>
>>>>>> >> I agree with that the new version should be some exciting
new
>>>>>> feature. How about this one:
>>>>>> >>
>>>>>> >> 6. ML/DL framework to be integrated as core component and
feature.
>>>>>> (Such as Angel / BigDL / ……)
>>>>>> >>
>>>>>> >> 3.0 is a very important version for an good open source
project.
>>>>>> It should be better to drift away the historical burden and focus
in new
>>>>>> area. Spark has been widely used all over the world as a successful
big
>>>>>> data framework. And it can be better than that.
>>>>>> >>
>>>>>> >> Andy
>>>>>> >>
>>>>>> >>
>>>>>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rxin@databricks.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> There was a discussion thread on scala-contributors
about Apache
>>>>>> Spark not yet supporting Scala 2.12, and that got me to think perhaps
it is
>>>>>> about time for Spark to work towards the 3.0 release. By the time
it comes
>>>>>> out, it will be more than 2 years since Spark 2.0.
>>>>>> >>>
>>>>>> >>> For contributors less familiar with Spark’s history,
I want to
>>>>>> give more context on Spark releases:
>>>>>> >>>
>>>>>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark
2.0 was July
>>>>>> 2016. If we were to maintain the ~ 2 year cadence, it is time to
work on
>>>>>> Spark 3.0 in 2018.
>>>>>> >>>
>>>>>> >>> 2. Spark’s versioning policy promises that Spark does
not break
>>>>>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes
are
>>>>>> sometimes a necessary evil, and can be done in major releases (e.g.
1.6 to
>>>>>> 2.0, 2.x to 3.0).
>>>>>> >>>
>>>>>> >>> 3. That said, a major version isn’t necessarily the
playground
>>>>>> for disruptive API changes to make it painful for users to update.
The main
>>>>>> purpose of a major release is an opportunity to fix things that are
broken
>>>>>> in the current API and remove certain deprecated APIs.
>>>>>> >>>
>>>>>> >>> 4. Spark as a project has a culture of evolving architecture
and
>>>>>> developing major new features incrementally, so major releases are
not the
>>>>>> only time for exciting new features. For example, the bulk of the
work in
>>>>>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>>>>>> Processing was introduced in Spark 2.3. Both were feature releases
rather
>>>>>> than major releases.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> You can find more background in the thread discussing
Spark 2.0:
>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> The primary motivating factor IMO for a major version
bump is to
>>>>>> support Scala 2.12, which requires minor API breaking changes to
Spark’s
>>>>>> APIs. Similar to Spark 2.0, I think there are also opportunities
for other
>>>>>> changes that we know have been biting us for a long time but can’t
be
>>>>>> changed in feature releases (to be clear, I’m actually not sure
they are
>>>>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>>>> >>>
>>>>>> >>> 1. Support Scala 2.12.
>>>>>> >>>
>>>>>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel)
>>>>>> deprecated in Spark 2.x.
>>>>>> >>>
>>>>>> >>> 3. Shade all dependencies.
>>>>>> >>>
>>>>>> >>> 4. Change the reserved keywords in Spark SQL to be more
ANSI-SQL
>>>>>> compliant, to prevent users from shooting themselves in the foot,
e.g.
>>>>>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias?
To make it
>>>>>> less painful for users to upgrade here, I’d suggest creating a
flag for
>>>>>> backward compatibility mode.
>>>>>> >>>
>>>>>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL
>>>>>> more standard compliant, and have a flag for backward compatibility.
>>>>>> >>>
>>>>>> >>> 6. Miscellaneous other small changes documented in JIRA
already
>>>>>> (e.g. “JavaPairRDD flatMapValues requires function returning Iterable,
not
>>>>>> Iterator”, “Prevent column name duplication in temporary view”).
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Now the reality of a major version bump is that the
world often
>>>>>> thinks in terms of what exciting features are coming. I do think
there are
>>>>>> a number of major changes happening already that can be part of the
3.0
>>>>>> release, if they make it in:
>>>>>> >>>
>>>>>> >>> 1. Scala 2.12 support (listing it twice)
>>>>>> >>> 2. Continuous Processing non-experimental
>>>>>> >>> 3. Kubernetes support non-experimental
>>>>>> >>> 4. A more flushed out version of data source API v2
(I don’t
>>>>>> think it is realistic to stabilize that in one release)
>>>>>> >>> 5. Hadoop 3.0 support
>>>>>> >>> 6. ...
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Similar to the 2.0 discussion, this thread should focus
on the
>>>>>> framework and whether it’d make sense to create Spark 3.0 as the
next
>>>>>> release, rather than the individual feature requests. Those are important
>>>>>> but are best done in their own separate threads.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>> +1 -224-436-0783
>>>> Greater Chicago
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>> +1 -224-436-0783
>>> Greater Chicago
>>>
>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783
>> Greater Chicago
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message