spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: [DISCUSS] Spark 2.5 release
Date Tue, 24 Sep 2019 22:54:13 GMT
> That's not a new requirement, that's an "implicit" requirement via
semantic versioning.

The expectation is that the DSv2 API will change in minor versions in the
2.x line. The API is marked with the Experimental API annotation to signal
that it can change, and it has been changing.

A requirement to not change this API for a 2.5 release is a new
requirement. I'm fine with that if that's what everyone wants. Like I said,
if we want to add a requirement to not change this API then we shouldn't
release the 2.5 that I'm proposing.

On Tue, Sep 24, 2019 at 2:51 PM Jungtaek Lim <kabhwan@gmail.com> wrote:

> >> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.
>
> > This has not been a requirement for DSv2 development so far. If this is
> a new requirement, then we should not do a 2.5 release.
>
> My 2 cents, target version of new DSv2 has been only 3.0 so we don't ever
> have a chance to think about such requirement - that's why there's no
> restriction on breaking compatibility on codebase. That's not a new
> requirement, that's an "implicit" requirement via semantic versioning. I
> agree that some of APIs have been changed between Spark 2.x versions, but I
> guess the changes in "new" DSv2 would be bigger than summation of changes
> on "old" DSv2 which has been introduced across multiple minor versions.
>
> Suppose we're developers of Spark ecosystem maintaining custom data source
> (forget about developing Spark): I would get some official announcement on
> next minor version, and I want to try it out quickly to see my stuff still
> supports new version. When I change the dependency version everything will
> break. My hopeful expectation would be no issue while upgrading but turns
> out it's not, and even it requires new learning (not only fixing
> compilation failures). It would just make me giving up support Spark 2.5 or
> at least I won't follow up such change quickly. IMHO 3.0-techpreview has
> advantage here (assuming we provide maven artifacts as well as official
> announcement), as it can give us expectation that there're bunch of changes
> given it's a new major version. It also provides bunch of time to try
> adopting it before the version is officially released.
>
>
> On Wed, Sep 25, 2019 at 4:56 AM Ryan Blue <rblue@netflix.com> wrote:
>
>> From those questions, I can see that there is significant confusion about
>> what I'm proposing, so let me try to clear it up.
>>
>> > 1. Is DSv2 stable in `master`?
>>
>> DSv2 has reached a stable API that is capable of supporting all of the
>> features we intend to deliver for Spark 3.0. The proposal is to backport
>> the same API and features for Spark 2.5.
>>
>> I am not saying that this API won't change after 3.0. Notably, Reynold
>> wants to change the use of InternalRow. But, these changes are after 3.0
>> and don't affect the compatibility I'm proposing, between the 2.5 and 3.0
>> releases. I also doubt that breaking changes would happen by 3.1.
>>
>> > 2. If then, what subset of DSv2 patches does Ryan is suggesting
>> backporting?
>>
>> I am proposing backporting what we intend to deliver for 3.0: the API
>> currently in master, SQL support, and multi-catalog support.
>>
>> > 3. How much those backporting DSv2 patches looks differently in
>> `branch-2.4`?
>>
>> DSv2 is mostly an addition located in the `connector` package. It also
>> changes some parts of the SQL parser and adds parsed plans, as well as new
>> rules to convert from parsed plans. This is not an invasive change because
>> we kept most of DSv2 separate. DSv2 should be nearly identical between the
>> two branches.
>>
>> > 4. What does he mean by `without breaking changes? Is it technically
>> feasible?
>>
>> DSv2 is marked unstable in the 2.x line and changes between releases. The
>> API changed between 2.3 and 2.4, so this would be no different. But, we
>> would keep the API the same between 2.5 and 3.0 to assist migration.
>>
>> This is technically feasible because what we are planning to deliver for
>> 3.0 is nearly ready, and the API has not needed to change recently.
>>
>> > Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.
>>
>> This has not been a requirement for DSv2 development so far. If this is a
>> new requirement, then we should not do a 2.5 release.
>>
>> > 5. How long does it take? Is it possible before 3.0.0-preview? Who will
>> work on that backporting?
>>
>> As I said, I'm already going to do this work, so I'm offering to release
>> it to the community. I don't know how long it will take, but this work and
>> 3.0-preview are not mutually exclusive.
>>
>> > 6. Is this meaningful if 2.5 and 3.1 become different again too soon
>> (in 2020 Summer)?
>>
>> It is useful to me, so I assume it is useful to others.
>>
>> I also think it is unlikely that 3.1 will need to make API changes to
>> DSv2. There may be some bugs found, but I don't think we will break API
>> compatibility so quickly. Most of the changes to the API will require only
>> additions.
>>
>> > If you have a working branch, please share with us.
>>
>> I don't have a branch to share.
>>
>>
>> On Mon, Sep 23, 2019 at 6:47 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
>> wrote:
>>
>>> Hi, Ryan.
>>>
>>> This thread has many replied as you see. That is the evidence that the
>>> community is interested in your suggestion a lot.
>>>
>>> > I'm offering to help build a stable release without breaking changes.
>>> But if there is no community interest in it, I'm happy to drop this.
>>>
>>> In this thread, the root cause of the disagreement is due to the lack of
>>> supporting evidence for your claims.
>>>
>>> 1. Is DSv2 stable in `master`?
>>> 2. If then, what subset of DSv2 patches does Ryan is suggesting
>>> backporting?
>>> 3. How much those backporting DSv2 patches looks differently in
>>> `branch-2.4`?
>>> 4. What does he mean by `without breaking changes? Is it technically
>>> feasible?
>>>     Apache Spark 2.4.x and 2.5.x DSv2 should be compatible. (Not between
>>> 2.5.x DSv2 and 3.0.0 DSv2)
>>> 5. How long does it take? Is it possible before 3.0.0-preview? Who will
>>> work on that backporting?
>>> 6. Is this meaningful if 2.5 and 3.1 become different again too soon (in
>>> 2020 Summer)?
>>>
>>> We are SW engineers.
>>> If you have a working branch, please share with us.
>>> It will help us understand your suggestion and this discussion.
>>> We can help you verify that branch achieves your goal.
>>> The branch is tested already, isn't it?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>> On Mon, Sep 23, 2019 at 10:44 AM Holden Karau <holden@pigscanfly.ca>
>>> wrote:
>>>
>>>> I would personally love to see us provide a gentle migration path to
>>>> Spark 3 especially if much of the work is already going to happen anyways.
>>>>
>>>> Maybe giving it a different name (eg something like
>>>> Spark-2-to-3-transitional) would make it more clear about its intended
>>>> purpose and encourage folks to move to 3 when they can?
>>>>
>>>> On Mon, Sep 23, 2019 at 9:17 AM Ryan Blue <rblue@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> My understanding is that 3.0-preview is not going to be a
>>>>> production-ready release. For those of us that have been using backports
of
>>>>> DSv2 in production, that doesn't help.
>>>>>
>>>>> It also doesn't help as a stepping stone because users would need to
>>>>> handle all of the incompatible changes in 3.0. Using 3.0-preview would
be
>>>>> an unstable release with breaking changes instead of a stable release
>>>>> without the breaking changes.
>>>>>
>>>>> I'm offering to help build a stable release without breaking changes.
>>>>> But if there is no community interest in it, I'm happy to drop this.
>>>>>
>>>>> On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon <gurwls223@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1 for Matei's as well.
>>>>>>
>>>>>> On Sun, 22 Sep 2019, 14:59 Marco Gaido, <marcogaido91@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I agree with Matei too.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Marco
>>>>>>>
>>>>>>> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
>>>>>>> dongjoon.hyun@gmail.com> ha scritto:
>>>>>>>
>>>>>>>> +1 for Matei's suggestion!
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <
>>>>>>>> matei.zaharia@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> If the goal is to get people to try the DSv2 API and
build DSv2
>>>>>>>>> data sources, can we recommend the 3.0-preview release
for this? That would
>>>>>>>>> get people shifting to 3.0 faster, which is probably
better overall
>>>>>>>>> compared to maintaining two major versions. There’s
not that much else
>>>>>>>>> changing in 3.0 if you already want to update your Java
version.
>>>>>>>>>
>>>>>>>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rblue@netflix.com.INVALID>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > If you insist we shouldn't change the unstable temporary
API in
>>>>>>>>> 3.x . . .
>>>>>>>>>
>>>>>>>>> Not what I'm saying at all. I said we should carefully
>>>>>>>>> consider whether a breaking change is the right decision
in the 3.x line.
>>>>>>>>>
>>>>>>>>> All I'm suggesting is that we can make a 2.5 release
with the
>>>>>>>>> feature and an API that is the same as the one in 3.0.
>>>>>>>>>
>>>>>>>>> > I also don't get this backporting a giant feature
to 2.x line
>>>>>>>>>
>>>>>>>>> I am planning to do this so we can use DSv2 before 3.0
is
>>>>>>>>> released. Then we can have a source implementation that
works in both 2.x
>>>>>>>>> and 3.0 to make the transition easier. Since I'm already
doing the work,
>>>>>>>>> I'm offering to share it with the community.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rxin@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Because for example we'd need to move the location
of
>>>>>>>>>> InternalRow, breaking the package name. If you insist
we shouldn't change
>>>>>>>>>> the unstable temporary API in 3.x to maintain compatibility
with 3.0, which
>>>>>>>>>> is totally different from my understanding of the
situation when you
>>>>>>>>>> exposed it, then I'd say we should gate 3.0 on having
a stable row
>>>>>>>>>> interface.
>>>>>>>>>>
>>>>>>>>>> I also don't get this backporting a giant feature
to 2.x line ...
>>>>>>>>>> as suggested by others in the thread, DSv2 would
be one of the main reasons
>>>>>>>>>> people upgrade to 3.0. What's so special about DSv2
that we are doing this?
>>>>>>>>>> Why not abandoning 3.0 entirely and backport all
the features to 2.x?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rblue@netflix.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Why would that require an incompatible change?
>>>>>>>>>>>
>>>>>>>>>>> We *could* make an incompatible change and remove
support for
>>>>>>>>>>> InternalRow, but I think we would want to carefully
consider whether that
>>>>>>>>>>> is the right decision. And in any case, we would
be able to keep 2.5 and
>>>>>>>>>>> 3.0 compatible, which is the main goal.
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rxin@databricks.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> How would you not make incompatible changes
in 3.x? As
>>>>>>>>>>>> discussed the InternalRow API is not stable
and needs to change.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue
<rblue@netflix.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> > Making downstream to diverge their
implementation heavily
>>>>>>>>>>>>> between minor versions (say, 2.4 vs 2.5)
wouldn't be a good experience
>>>>>>>>>>>>>
>>>>>>>>>>>>> You're right that the API has been evolving
in the 2.x
>>>>>>>>>>>>> line. But, it is now reasonably stable
with respect to the current feature
>>>>>>>>>>>>> set and we should not need to break compatibility
in the 3.x line. Because
>>>>>>>>>>>>> we have reached our goals for the 3.0
release, we can backport at least
>>>>>>>>>>>>> those features to 2.x and confidently
have an API that works in both a 2.x
>>>>>>>>>>>>> release and is compatible with 3.0, if
not 3.1 and later releases as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> > I'd rather say preparation of Spark
2.5 should be started
>>>>>>>>>>>>> after Spark 3.0 is officially released
>>>>>>>>>>>>>
>>>>>>>>>>>>> The reason I'm suggesting this is that
I'm already going to do
>>>>>>>>>>>>> the work to backport the 3.0 release
features to 2.4. I've been asked by
>>>>>>>>>>>>> several people when DSv2 will be released,
so I know there is a lot of
>>>>>>>>>>>>> interest in making this available sooner
than 3.0. If I'm already doing the
>>>>>>>>>>>>> work, then I'd be happy to share that
with the community.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't see why 2.5 and 3.0 are mutually
exclusive. We can
>>>>>>>>>>>>> work on 2.5 while preparing the 3.0 preview
and fixing bugs. For DSv2, the
>>>>>>>>>>>>> work is about complete so we can easily
release the same set of features
>>>>>>>>>>>>> and API in 2.5 and 3.0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we decide for some reason to wait
until after 3.0 is
>>>>>>>>>>>>> released, I don't know that there is
much value in a 2.5. The purpose is to
>>>>>>>>>>>>> be a step toward 3.0, and releasing that
step after 3.0 doesn't seem
>>>>>>>>>>>>> helpful to me. It also wouldn't get these
features out any sooner than 3.0,
>>>>>>>>>>>>> as a 2.5 release probably would, given
the work needed to validate the
>>>>>>>>>>>>> incompatible changes in 3.0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> > DSv2 change would be the major backward
incompatibility
>>>>>>>>>>>>> which Spark 2.x users may hesitate to
upgrade
>>>>>>>>>>>>>
>>>>>>>>>>>>> As I pointed out, DSv2 has been changing
in the 2.x line, so
>>>>>>>>>>>>> this is expected. I don't think it will
need incompatible changes in the
>>>>>>>>>>>>> 3.x line.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek
Lim <
>>>>>>>>>>>>> kabhwan@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just 2 cents, I haven't tracked the
change of DSv2 (though I
>>>>>>>>>>>>>> needed to deal with this as the change
made confusion on my PRs...), but my
>>>>>>>>>>>>>> bet is that DSv2 would be already
changed in incompatible way, at least who
>>>>>>>>>>>>>> works for custom DataSource. Making
downstream to diverge their
>>>>>>>>>>>>>> implementation heavily between minor
versions (say, 2.4 vs 2.5) wouldn't be
>>>>>>>>>>>>>> a good experience - especially we
are not completely closed the chance
>>>>>>>>>>>>>> to further modify DSv2, and the change
could be backward incompatible.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we really want to bring the DSv2
change to 2.x version
>>>>>>>>>>>>>> line to let end users avoid forcing
to upgrade Spark 3.x to enjoy new DSv2,
>>>>>>>>>>>>>> I'd rather say preparation of Spark
2.5 should be started after Spark 3.0
>>>>>>>>>>>>>> is officially released, honestly
even later than that, say, getting some
>>>>>>>>>>>>>> reports from Spark 3.0 about DSv2
so that we feel DSv2 is OK. I hope we
>>>>>>>>>>>>>> don't make Spark 2.5 be a kind of
"tech-preview" which Spark 2.4 users may
>>>>>>>>>>>>>> be frustrated to upgrade to next
minor version.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Btw, do we have any specific target
users for this?
>>>>>>>>>>>>>> Personally DSv2 change would be the
major backward incompatibility which
>>>>>>>>>>>>>> Spark 2.x users may hesitate to upgrade,
so they might be already prepared
>>>>>>>>>>>>>> to migrate to Spark 3.0 if they are
prepared to migrate to new DSv2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Sep 21, 2019 at 12:46 PM
Dongjoon Hyun <
>>>>>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you mean you want to have
a breaking API change between
>>>>>>>>>>>>>>> 3.0 and 3.1?
>>>>>>>>>>>>>>> I believe we follow Semantic
Versioning (
>>>>>>>>>>>>>>> https://spark.apache.org/versioning-policy.html
).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > We just won’t add any
breaking changes before 3.1.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:48
AM Ryan Blue <
>>>>>>>>>>>>>>> rblue@netflix.com.invalid>
wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don’t think we need to
gate a 3.0 release on making a
>>>>>>>>>>>>>>>> more stable version of InternalRow
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sounds like we agree, then.
We will use it for 3.0, but
>>>>>>>>>>>>>>>> there are known problems
with it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thinking we’d have dsv2
working in both 3.x (which will
>>>>>>>>>>>>>>>> change and progress towards
more stable, but will have to break certain
>>>>>>>>>>>>>>>> APIs) and 2.x seems like
a false premise.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Why do you think we will
need to break certain APIs before
>>>>>>>>>>>>>>>> 3.0?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I’m only suggesting that
we release the same support in a
>>>>>>>>>>>>>>>> 2.5 release that we do in
3.0. Since we are nearly finished with the 3.0
>>>>>>>>>>>>>>>> goals, it seems like we can
certainly do that. We just won’t add any
>>>>>>>>>>>>>>>> breaking changes before 3.1.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:39
AM Reynold Xin <
>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think we need
to gate a 3.0 release on making a
>>>>>>>>>>>>>>>>> more stable version of
InternalRow, but thinking we'd have dsv2 working in
>>>>>>>>>>>>>>>>> both 3.x (which will
change and progress towards more stable, but will have
>>>>>>>>>>>>>>>>> to break certain APIs)
and 2.x seems like a false premise.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To point out some problems
with InternalRow that you think
>>>>>>>>>>>>>>>>> are already pragmatic
and stable:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The class is in catalyst,
which states:
>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>>>>> * Catalyst is a library
for manipulating relational query
>>>>>>>>>>>>>>>>> plans.  All classes in
catalyst are
>>>>>>>>>>>>>>>>> * considered an internal
API to Spark SQL and are subject
>>>>>>>>>>>>>>>>> to change between minor
releases.
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There is no even any
annotation on the interface.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The entire dependency
chain were created to be private,
>>>>>>>>>>>>>>>>> and tightly coupled with
internal implementations. For example,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>>>>> * A UTF-8 String for
internal Spark use.
>>>>>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>>>>>> * A String encoded in
UTF-8 as an Array[Byte], which can
>>>>>>>>>>>>>>>>> be used for comparison,
>>>>>>>>>>>>>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8
for
>>>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>>>>>> * Note: This is not designed
for general use cases, should
>>>>>>>>>>>>>>>>> not be used outside SQL.
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (which again is in catalyst
package)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If you want to argue
this way, you might as well argue we
>>>>>>>>>>>>>>>>> should make the entire
catalyst package public to be pragmatic and not
>>>>>>>>>>>>>>>>> allow any changes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019
at 11:32 AM, Ryan Blue <
>>>>>>>>>>>>>>>>> rblue@netflix.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When you created
the PR to make InternalRow public
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This isn’t quite
accurate. The change I made was to use
>>>>>>>>>>>>>>>>>> InternalRow instead
of UnsafeRow, which is a specific
>>>>>>>>>>>>>>>>>> implementation of
InternalRow. Exposing this API has
>>>>>>>>>>>>>>>>>> always been a part
of DSv2 and while both you and I did some work to avoid
>>>>>>>>>>>>>>>>>> this, we are still
in the phase of starting with that API.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Note that any change
to InternalRow would be very costly
>>>>>>>>>>>>>>>>>> to implement because
this interface is widely used. That is why I think we
>>>>>>>>>>>>>>>>>> can certainly consider
it stable enough to use here, and that’s probably
>>>>>>>>>>>>>>>>>> why UnsafeRow was
part of the original proposal.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In any case, the
goal for 3.0 was not to replace the use
>>>>>>>>>>>>>>>>>> of InternalRow, it
was to get the majority of SQL
>>>>>>>>>>>>>>>>>> working on top of
the interface added after 2.4. That’s done and stable, so
>>>>>>>>>>>>>>>>>> I think a 2.5 release
with it is also reasonable.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019
at 11:23 AM Reynold Xin <
>>>>>>>>>>>>>>>>>> rxin@databricks.com>
wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> To push back,
while I agree we should not drastically
>>>>>>>>>>>>>>>>>>> change "InternalRow",
there are a lot of changes that need to happen to
>>>>>>>>>>>>>>>>>>> make it stable.
For example, none of the publicly exposed interfaces should
>>>>>>>>>>>>>>>>>>> be in the Catalyst
package or the unsafe package. External implementations
>>>>>>>>>>>>>>>>>>> should be decoupled
from the internal implementations, with cheap ways to
>>>>>>>>>>>>>>>>>>> convert back
and forth.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> When you created
the PR to make InternalRow public, the
>>>>>>>>>>>>>>>>>>> understanding
was to work towards making it stable in the future, assuming
>>>>>>>>>>>>>>>>>>> we will start
with an unstable API temporarily. You can't just make a bunch
>>>>>>>>>>>>>>>>>>> internal APIs
tightly coupled with other internal pieces public and stable
>>>>>>>>>>>>>>>>>>> and call it a
day, just because it happen to satisfy some use cases
>>>>>>>>>>>>>>>>>>> temporarily assuming
the rest of Spark doesn't change.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Sep 20,
2019 at 11:19 AM, Ryan Blue <
>>>>>>>>>>>>>>>>>>> rblue@netflix.com>
wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > DSv2
is far from stable right?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> No, I think
it is reasonably stable and very close to
>>>>>>>>>>>>>>>>>>>> being ready
for a release.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > All
the actual data types are unstable and you guys
>>>>>>>>>>>>>>>>>>>> have completely
ignored that.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I think what
you're referring to is the use of
>>>>>>>>>>>>>>>>>>>> `InternalRow`.
That's a stable API and there has been no work to avoid
>>>>>>>>>>>>>>>>>>>> using it.
In any case, I don't think that anyone is suggesting that we
>>>>>>>>>>>>>>>>>>>> delay 3.0
until a replacement for `InternalRow` is added, right?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> While I understand
the motivation for a better solution
>>>>>>>>>>>>>>>>>>>> here, I think
the pragmatic solution is to continue using `InternalRow`.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > If the
goal is to make DSv2 work across 3.x and 2.x,
>>>>>>>>>>>>>>>>>>>> that seems
too invasive of a change to backport once you consider the parts
>>>>>>>>>>>>>>>>>>>> needed to
make dsv2 stable.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I believe
that those of us working on DSv2 are
>>>>>>>>>>>>>>>>>>>> confident
about the current stability. We set goals for what to get into
>>>>>>>>>>>>>>>>>>>> the 3.0 release
months ago and have very nearly reached the point where we
>>>>>>>>>>>>>>>>>>>> are ready
for that release.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't think
instability would be a problem in
>>>>>>>>>>>>>>>>>>>> maintaining
compatibility between the 2.5 version and the 3.0 version. If
>>>>>>>>>>>>>>>>>>>> we find that
we need to make API changes (other than additions) then we can
>>>>>>>>>>>>>>>>>>>> make those
in the 3.1 release. Because the goals we set for the 3.0 release
>>>>>>>>>>>>>>>>>>>> have been
reached with the current API and if we are ready to release 3.0,
>>>>>>>>>>>>>>>>>>>> we can release
a 2.5 with the same API.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Sep
20, 2019 at 11:05 AM Reynold Xin <
>>>>>>>>>>>>>>>>>>>> rxin@databricks.com>
wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> DSv2
is far from stable right? All the actual data
>>>>>>>>>>>>>>>>>>>>> types
are unstable and you guys have completely ignored that. We'd need to
>>>>>>>>>>>>>>>>>>>>> work
on that and that will be a breaking change. If the goal is to make
>>>>>>>>>>>>>>>>>>>>> DSv2
work across 3.x and 2.x, that seems too invasive of a change to
>>>>>>>>>>>>>>>>>>>>> backport
once you consider the parts needed to make dsv2 stable.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri,
Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid>
wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi
everyone,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> In
the DSv2 sync this week, we talked about a
>>>>>>>>>>>>>>>>>>>>>> possible
Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and
>>>>>>>>>>>>>>>>>>>>>> Java
11 support added.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> A
Spark 2.5 release with these two additions will
>>>>>>>>>>>>>>>>>>>>>> help
people migrate to Spark 3.0 when it is released because they will be
>>>>>>>>>>>>>>>>>>>>>> able
to use a single implementation for DSv2 sources that works in both 2.5
>>>>>>>>>>>>>>>>>>>>>> and
3.0. Similarly, upgrading to 3.0 won't also require also updating to
>>>>>>>>>>>>>>>>>>>>>> Java
11 because users could update to Java 11 with the 2.5 release and have
>>>>>>>>>>>>>>>>>>>>>> fewer
major changes.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Another
reason to consider a 2.5 release is that many
>>>>>>>>>>>>>>>>>>>>>> people
are interested in a release with the latest DSv2 API and support for
>>>>>>>>>>>>>>>>>>>>>> DSv2
SQL. I'm already going to be backporting DSv2 support to the Spark 2.4
>>>>>>>>>>>>>>>>>>>>>> line,
so it makes sense to share this work with the community.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> This
release line would just consist of backports
>>>>>>>>>>>>>>>>>>>>>> like
DSv2 and Java 11 that assist compatibility, to keep the scope of the
>>>>>>>>>>>>>>>>>>>>>> release
small. The purpose is to assist people moving to 3.0 and not
>>>>>>>>>>>>>>>>>>>>>> distract
from the 3.0 release.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Would
a Spark 2.5 release help anyone else? Are there
>>>>>>>>>>>>>>>>>>>>>> any
concerns about this plan?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Ryan
Blue
>>>>>>>>>>>>>>>>>>>>>> Software
Engineer
>>>>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>>> Software
Engineer
>>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Name : Jungtaek Lim
>>>>>>>>>>>>>> Blog : http://medium.com/@heartsavior
>>>>>>>>>>>>>> Twitter : http://twitter.com/heartsavior
>>>>>>>>>>>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>
>>>>>>>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message