spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <>
Subject Re: [Proposal] Modification to Spark's Semantic Versioning Policy
Date Fri, 13 Mar 2020 02:26:44 GMT
Xiao, thanks for the proposal and willingness to lead the effort!

I feel that it's still a bit different from what I've proposed. What I'm
proposing is closer to enforce discussion if the change proposes new public
API or brings breaking change. It's good that we add the section "Does this
PR introduce any user-facing change?" into the PR template (I'm not 100%
sure it's being used as its intention), but it doesn't enforce anything;
PRs containing breaking change are being reviewed and merged as same as
other PRs, no difference. Technically it can be merged in a couple of
hours, with only reviewed by one committer which doesn't seem to be enough
to decide it's good to go, IMHO.

I believe regular digest would be one step forward, as someone could notice
the change and jump in post-hoc review. One thing I'm a bit afraid of
post-hoc review is that it's not easy to expose concerns about already
merged things, especially if we have to revert. It makes both sides be
defensive; hesitate to do post-review, trying to defend the change we
already made. I'm big +1 to make one step further, but given we are
revisiting the policy, it would be nice if we revisit the policy of the
change of public API as well.

On Mon, Mar 9, 2020 at 2:39 PM Dongjoon Hyun <>

> Thank you all. Especially, the Audit efforts.
> Until now, the whole community has been working together in the same
> direction with the existing policy. It is always good.
> Since it seems that we are considering to have a new direction, I created
> an umbrella JIRA to track all activities.
>       Amend Spark's Semantic Versioning Policy
> As we know, the community-wide directional change always has a huge impact
> on daily PR reviews and regular releases. So, we had better consider all
> the reverting PRs as a normal independent PR instead of the follow-ups.
> Specifically, I believe we need the following.
>     1. Have new JIRA IDs instead of considering a simple revert or
> follow-up.
>         It's because we are not adding everything back blindly. For
> example,
>             "Add back ImageSchema.readImages in Spark 3.0"
>         is created and closed as 'Won't Do' with consideration between the
> trade-off.
>         We need to have a JIRA-issue-level history for this kind of
> request and the decision.
>     2. Sometime, as described by Michael, reverting is insufficient.
>         We need to provide a more fine-grained deprecation for users'
> safety case by case.
>     3. Given the timeline, newly added API should have a test coverage in
> the same PR from the beginning.
>         This is required because the whole reverting efforts aim to give a
> working API back.
> I believe that we have a good discussion in this thread.
> We are making a big change in Apache Spark history.
> Please be part of the history by taking actions like replying, voting, and
> reviewing.
> Thanks,
> Dongjoon.
> On Sat, Mar 7, 2020 at 11:20 PM Takeshi Yamamuro <>
> wrote:
>> Yea, +1 on Jungtaek's suggestion; having the same strict policy for
>> adding new APIs looks nice.
>> > When we making the API changes (e.g., adding the new APIs or changing
>> the existing APIs), we should regularly publish them in the dev list. I am
>> willing to lead this effort, work with my colleagues to summarize all the
>> merged commits [especially the API changes], and then send the *bi-weekly
>> digest *to the dev list
>> This digest looks very helpful for the community, thanks, Xiao!
>> Bests,
>> Takeshi
>> On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <> wrote:
>>> I want to thank you *Ruifeng Zheng* publicly for his work that lists
>>> all the signature differences of Core, SQL and Hive we made in this
>>> upcoming release. For details, please read the files attached in
>>> SPARK-30982 <>. I went
>>> over these files and submitted the following PRs to add back the SparkSQL
>>> APIs whose maintenance costs are low based on my own experiences in
>>> SparkSQL development:
>>>    -
>>>    - functions.toDegrees/toRadians
>>>       - functions.approxCountDistinct
>>>       - functions.monotonicallyIncreasingId
>>>       - Column.!==
>>>       - Dataset.explode
>>>       - Dataset.registerTempTable
>>>       - SQLContext.getOrCreate, setActive, clearActive, constructors
>>>    -
>>>       - HiveContext
>>>       - createExternalTable APIs
>>>    -
>>>    -
>>>       - SQLContext.applySchema
>>>       - SQLContext.parquetFile
>>>       - SQLContext.jsonFile
>>>       - SQLContext.jsonRDD
>>>       - SQLContext.load
>>>       - SQLContext.jdbc
>>> If you think these APIs should not be added back, let me know and we can
>>> discuss the items further. In general, I think we should provide more
>>> evidences and discuss them publicly when we dropping these APIs at the
>>> beginning.
>>> +1 on Jungtaek's comments. When we making the API changes (e.g., adding
>>> the new APIs or changing the existing APIs), we should regularly publish
>>> them in the dev list. I am willing to lead this effort, work with my
>>> colleagues to summarize all the merged commits [especially the API
>>> changes], and then send the *bi-weekly digest *to the dev list. If you
>>> are willing to join this working group and help build these digests, feel
>>> free to send me a note [].
>>> Cheers,
>>> Xiao
>>> Jungtaek Lim <> 于2020年3月7日周六 下午4:50写道:
>>>> +1 for Sean as well.
>>>> Moreover, as I added a voice on previous thread, if we want to be
>>>> strict with retaining public API, what we really need to do along with this
>>>> is having similar level or stricter of policy for adding public API. If we
>>>> don't apply the policy symmetrically, problems would go worse as it's still
>>>> not that hard to add public API (only require normal review) but once the
>>>> API is added and released it's going to be really hard to remove it.
>>>> If we consider adding public API and deprecating/removing public API as
>>>> "critical" one for the project, IMHO, it would give better visibility and
>>>> open discussion if we make it going through dev@ mailing list instead
>>>> of directly filing a PR. As there're so many PRs being submitted it's
>>>> nearly impossible to look into all of PRs - it may require us to "watch"
>>>> the repo and have tons of mails. Compared to the popularity on Github PRs,
>>>> dev@ mailing list is not that crowded so less chance of missing the
>>>> critical changes, and not quickly decided by only a couple of committers.
>>>> These suggestions would slow down the developments - that would make us
>>>> realize we may want to "classify/mark" user facing public APIs and others
>>>> (just exposed as public) and only apply all the policies to former. For
>>>> latter we don't need to guarantee anything.
>>>> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <>
>>>> wrote:
>>>>> +1 for Sean's concerns and questions.
>>>>> Bests,
>>>>> Dongjoon.
>>>>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <> wrote:
>>>>>> This thread established some good general principles, illustrated
>>>>>> a few good examples. It didn't draw specific conclusions about what
to add
>>>>>> back, which is why it wasn't at all controversial. What it means
>>>>>> specific cases is where there may be disagreement, and that harder
>>>>>> hasn't been addressed.
>>>>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>>>>> there are several more going on now, some pretty broad. I am not
even sure
>>>>>> what all of them are. In addition to below,
>>>>>> Would it be too much
>>>>>> overhead to post to this thread any changes that one believes are
>>>>>> by these principles and perhaps a more strict interpretation of them
>>>>>> It's important enough we should get any data points or input, and
>>>>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>>>>> actually sounds like a good vehicle for that -- as long as people
>>>>>> about them!
>>>>>> Also, is there any usage data available to share? many arguments
>>>>>> around 'commonly used' but can we know that more concretely?
>>>>>> Otherwise I think we'll back into implementing personal
>>>>>> interpretations of general principles, which is arguably the issue
in the
>>>>>> first place, even when everyone believes in good faith in the same
>>>>>> principles.
>>>>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <>
>>>>>> wrote:
>>>>>>> Hi, All.
>>>>>>> Recently, reverting PRs seems to start to spread like the
>>>>>>> *well-known* virus.
>>>>>>> Can we finalize this first before doing unofficial personal
>>>>>>> decisions?
>>>>>>> Technically, this thread was not a vote and our website doesn't
>>>>>>> a clear policy yet.
>>>>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>>>>     ==> This technically revert most of the SPARK-25908.
>>>>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of
>>>>>>> operands"
>>>>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <
>>>>>>>> wrote:
>>>>>>>> Hi, All.
>>>>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <>
>>>>>>>> wrote:
>>>>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <
>>>>>>>>>> wrote:
>>>>>>>>> >>     1. Could you estimate how many revert commits
are required
>>>>>>>>> in `branch-3.0` for new rubric?
>>>>>>>>> Fair question about what actual change this implies for
3.0? so
>>>>>>>>> far it
>>>>>>>>> seems like some targeted, quite reasonable reverts. I
don't think
>>>>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>>>> >>     2. Are you going to revert all removed test
cases for the
>>>>>>>>> deprecated ones?
>>>>>>>>> > This is a good point, making sure we keep the tests
as well is
>>>>>>>>> important (worse than removing a deprecated API is shipping
it broken),.
>>>>>>>>> (I'd say, yes of course! which seems consistent with
what is
>>>>>>>>> happening now)
>>>>>>>>> >>     3. Does it make any delay for Apache Spark
3.0.0 release?
>>>>>>>>> >>         (I believe it was previously scheduled
on June before
>>>>>>>>> Spark Summit 2020)
>>>>>>>>> >
>>>>>>>>> > I think if we need to delay to make a better release
this is ok,
>>>>>>>>> especially given our current preview releases being available
to gather
>>>>>>>>> community feedback.
>>>>>>>>> Of course these things block 3.0 -- all the more reason
to keep it
>>>>>>>>> specific and targeted -- but nothing so far seems inconsistent
>>>>>>>>> finishing in a month or two.
>>>>>>>>> >> Although there was a discussion already, I want
to make the
>>>>>>>>> following tough parts sure.
>>>>>>>>> >>     4. We are not going to add Scala 2.11 API,
>>>>>>>>> > I hope not.
>>>>>>>>> >>
>>>>>>>>> >>     5. We are not going to support Python 2.x
in Apache Spark
>>>>>>>>> 3.1+, right?
>>>>>>>>> > I think doing that would be bad, it's already end
of lifed
>>>>>>>>> elsewhere.
>>>>>>>>> Yeah this is an important subtext -- the valuable principles
>>>>>>>>> could be interpreted in many different ways depending
on how much
>>>>>>>>> you
>>>>>>>>> weight value of keeping APIs for compatibility vs value
>>>>>>>>> simplifying
>>>>>>>>> Spark and pushing users to newer APIs more forcibly.
They're all
>>>>>>>>> judgment calls, based on necessarily limited data about
>>>>>>>>> universe
>>>>>>>>> of users. We can only go on rare direct user feedback,
on feedback
>>>>>>>>> perhaps from vendors as proxies for a subset of users,
and the
>>>>>>>>> general
>>>>>>>>> good faith judgment of committers who have lived Spark
for years.
>>>>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>>>>> tightening going forward, and retroactively a bit for
3.0. But, I
>>>>>>>>> do
>>>>>>>>> not think anyone is advocating for the logical extreme
of, for
>>>>>>>>> example, maintaining Scala 2.11 compatibility indefinitely.
I think
>>>>>>>>> that falls out readily from the rubric here: maintaining
>>>>>>>>> compatibility is really quite painful if you ever support
2.13 too,
>>>>>>>>> for example.
>> --
>> ---
>> Takeshi Yamamuro

View raw message