spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: Branch 2.4 is cut
Date Mon, 10 Sep 2018 07:34:44 GMT
There are a lot of "breaking" changes we made in 2.4 for data source v2,
while I agree SPARK-24882 is "breaking" most.

I don't agree SPARK-24882 is half-baked. But I'm willing to revert it if we
have a bunch of data source v2 users and they are not willing to update
their implementation intensely before data source v2 API is stabilized.

On Mon, Sep 10, 2018 at 2:55 PM Arun Mahadevan <arunm@apache.org> wrote:

> Ryan's proposal makes a lot of sense. Its better not to release half-baked
> changes in 2.4 which not only breaks a lot of the APIs released in 2.3, but
> also expected to change further due redesigns before 3.0 so don't see much
> value releasing it in 2.4.
>
> On Sun, 9 Sep 2018 at 22:42, Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>> Strictly speaking, data source v2 is always half-finished until we mark
>> it as stable. We need some small milestones to move forward step by step.
>>
>> The redesign also happens in an incremental way. SPARK-24882 mostly focus
>> on the "RDD" part of the API: the separation of reader factory and input
>> partitions, the introduction of ScanConfig, etc. Then we focus on the
>> high-level abstraction and want to change the "table" part of the API.
>>
>> In my understanding, each PR should be self-contained. If we are OK to
>> have SPARK-24882 in master as an individual commit, I think it's also OK to
>> have it in branch 2.4.
>>
>> I've created https://issues.apache.org/jira/browse/SPARK-25390 to track
>> the new abstraction. It doesn't change the API a lot, but update the
>> streaming execution engine quite a bit.
>>
>> Thanks,
>> Wenchen
>>
>> On Mon, Sep 10, 2018 at 4:20 AM Ryan Blue <rblue@netflix.com> wrote:
>>
>>> Wenchen, can you hold off on the first RC?
>>>
>>> The half-finished changes from the redesign of the DataSourceV2 API are
>>> in master, added in SPARK-24882
>>> <https://github.com/apache/spark/pull/22009>, and are now in the 2.4
>>> branch. We've had a lot of good discussion since that PR was merged to
>>> update and fix the design, plus only one of the follow-ups on
>>> SPARK-25186 <https://issues.apache.org/jira/browse/SPARK-25186> is
>>> done. Clearly, the redesign was too large to get into 2.4 in so little time
>>> -- it was proposed about 10 days before the original branch date -- and I
>>> don't think it is a good idea to release half-finished major changes.
>>>
>>> The easiest solution is to revert SPARK-24882 in the release branch.
>>> That way we have minor changes in 2.4 and major changes in the next
>>> release, instead of major changes in both. What does everyone think?
>>>
>>> rb
>>>
>>> On Fri, Sep 7, 2018 at 10:37 AM shane knapp <sknapp@berkeley.edu> wrote:
>>>
>>>> ++joshrosen  (thanks for the help w/deploying the jenkins configs)
>>>>
>>>> the basic 2.4 builds are deployed and building!
>>>>
>>>> i haven't created (a) build(s) yet for scala 2.12...  i'll be
>>>> coordinating this w/the databricks folks next week.
>>>>
>>>> On Fri, Sep 7, 2018 at 9:53 AM, Dongjoon Hyun <dongjoon.hyun@gmail.com>
>>>> wrote:
>>>>
>>>>> Thank you, Shane! :D
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Fri, Sep 7, 2018 at 9:51 AM shane knapp <sknapp@berkeley.edu>
>>>>> wrote:
>>>>>
>>>>>> i'll try and get to the 2.4 branch stuff today...
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Mime
View raw message