Iím a bit concern about what Arun is summarizing?

We are building on DSv2 and already have to rewrite for bunch of changes in master/2.4, increasing in cost for dev work and release management.

If we are saying more changes are coming in 3.0, do we have more info on what value the current changes in 2.4 are adding now?


From: Wenchen Fan <cloud0fan@gmail.com>
Sent: Monday, September 10, 2018 12:35 AM
To: arunm@apache.org
Cc: Ryan Blue; sknapp@berkeley.edu; Dongjoon Hyun; joshrosen@databricks.com; Sean Owen; Spark dev list
Subject: Re: Branch 2.4 is cut
There are a lot of "breaking" changes we made in 2.4 for data source v2, while I agree SPARK-24882 is "breaking" most.

I don't agree SPARK-24882 is half-baked. But I'm willing to revert it if we have a bunch of data source v2 users and they are not willing to update their implementation intensely before data source v2 API is stabilized.

On Mon, Sep 10, 2018 at 2:55 PM Arun Mahadevan <arunm@apache.org> wrote:
Ryan's proposal makes a lot of sense. Its better not to release half-baked changes in 2.4 which not only breaks a lot of the APIs released in 2.3, but also expected to change further due redesigns before 3.0 so don't see much value releasing it in 2.4.

On Sun, 9 Sep 2018 at 22:42, Wenchen Fan <cloud0fan@gmail.com> wrote:
Strictly speaking, data source v2 is always half-finished until we mark it as stable. We need some small milestones to move forward step by step.

The redesign also happens in an incremental way. SPARK-24882 mostly focus on the "RDD" part of the API: the separation of reader factory and input partitions, the introduction of ScanConfig, etc. Then we focus on the high-level abstraction and want to change the "table" part of the API.

In my understanding, each PR should be self-contained. If we are OK to have SPARK-24882 in master as an individual commit, I think it's also OK to have it in branch 2.4.

I've created https://issues.apache.org/jira/browse/SPARK-25390 to track the new abstraction. It doesn't change the API a lot, but update the streaming execution engine quite a bit.


On Mon, Sep 10, 2018 at 4:20 AM Ryan Blue <rblue@netflix.com> wrote:
Wenchen, can you hold off on the first RC?

The half-finished changes from the redesign of the DataSourceV2 API are in master, added in SPARK-24882, and are now in the 2.4 branch. We've had a lot of good discussion since that PR was merged to update and fix the design, plus only one of the follow-ups on SPARK-25186 is done. Clearly, the redesign was too large to get into 2.4 in so little time -- it was proposed about 10 days before the original branch date -- and I don't think it is a good idea to release half-finished major changes.

The easiest solution is to revert SPARK-24882 in the release branch. That way we have minor changes in 2.4 and major changes in the next release, instead of major changes in both. What does everyone think?


On Fri, Sep 7, 2018 at 10:37 AM shane knapp <sknapp@berkeley.edu> wrote:
++joshrosen  (thanks for the help w/deploying the jenkins configs)

the basic 2.4 builds are deployed and building!

i haven't created (a) build(s) yet for scala 2.12...  i'll be coordinating this w/the databricks folks next week.

On Fri, Sep 7, 2018 at 9:53 AM, Dongjoon Hyun <dongjoon.hyun@gmail.com> wrote:
Thank you, Shane! :D


On Fri, Sep 7, 2018 at 9:51 AM shane knapp <sknapp@berkeley.edu> wrote:
i'll try and get to the 2.4 branch stuff today...  

Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead

Ryan Blue
Software Engineer