spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Spark Improvement Proposals
Date Sun, 09 Oct 2016 22:20:56 GMT
Yeah, I've looked at KIPs and Scala SIPs.

I'm reluctant to use the Kafka structured streaming as an example
because of the pre-existing conflict around it.  If Michael or another
committer wanted to put it forth as an example, I'd participate in
good faith though.

On Sun, Oct 9, 2016 at 5:07 PM, Ofir Manor <ofir.manor@equalum.io> wrote:
> This is a great discussion!
> Maybe you could have a look at Kafka's process - it also uses Rejected
> Alternatives and I personally find it very clear actually (the link also
> leads to all KIPs):
>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> Cody - maybe you could take one of the open issues and write a sample
> proposal? A concrete example might make it clearer for those who see this
> for the first time. Maybe the Kafka offset discussion or some other
> Kafka/Structured Streaming open issue? Will that be helpful?
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>
>
> On Mon, Oct 10, 2016 at 12:36 AM, Matei Zaharia <matei.zaharia@gmail.com>
> wrote:
>>
>> Yup, this is the stuff that I found unclear. Thanks for clarifying here,
>> but we should also clarify it in the writeup. In particular:
>>
>> - Goals needs to be about user-facing behavior ("people" is broad)
>>
>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up
>> one of these and say "Spark's developers have officially rejected X, which
>> our awesome system has".
>>
>> - For user-facing stuff, I think you need a section on API. Virtually all
>> other *IPs I've seen have that.
>>
>> - I'm still not sure why the strategy section is needed if the purpose is
>> to define user-facing behavior -- unless this is the strategy for setting
>> the goals or for defining the API. That sounds squarely like a design doc
>> issue. In some sense, who cares whether the proposal is technically feasible
>> right now? If it's infeasible, that will be discovered later during design
>> and implementation. Same thing with rejected strategies -- listing some of
>> those is definitely useful sometimes, but if you make this a *required*
>> section, people are just going to fill it in with bogus stuff (I've seen
>> this happen before).
>>
>> Matei
>>
>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <cody@koeninger.org> wrote:
>> >
>> > So to focus the discussion on the specific strategy I'm suggesting,
>> > documented at
>> >
>> >
>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >
>> > "Goals: What must this allow people to do, that they can't currently?"
>> >
>> > Is it unclear that this is focusing specifically on people-visible
>> > behavior?
>> >
>> > Rejected goals -  are important because otherwise people keep trying
>> > to argue about scope.  Of course you can change things later with a
>> > different SIP and different vote, the point is to focus.
>> >
>> > Use cases - are something that people are going to bring up in
>> > discussion.  If they aren't clearly documented as a goal ("This must
>> > allow me to connect using SSL"), they should be added.
>> >
>> > Internal architecture - if the people who need specific behavior are
>> > implementers of other parts of the system, that's fine.
>> >
>> > Rejected strategies - If you have none of these, you have no evidence
>> > that the proponent didn't just go with the first thing they had in
>> > mind (or have already implemented), which is a big problem currently.
>> > Approval isn't binding as to specifics of implementation, so these
>> > aren't handcuffs.  The goals are the contract, the strategy is
>> > evidence that contract can actually be met.
>> >
>> > Design docs - I'm not touching design docs.  The markdown file I
>> > linked specifically says of the strategy section "This is not a full
>> > design document."  Is this unclear?  Design docs can be worked on
>> > obviously, but that's not what I'm concerned with here.
>> >
>> >
>> >
>> >
>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaharia@gmail.com>
>> > wrote:
>> >> Hi Cody,
>> >>
>> >> I think this would be a lot more concrete if we had a more detailed
>> >> template
>> >> for SIPs. Right now, it's not super clear what's in scope -- e.g. are
>> >> they
>> >> a way to solicit feedback on the user-facing behavior or on the
>> >> internals?
>> >> "Goals" can cover both things. I've been thinking of SIPs more as
>> >> Product
>> >> Requirements Docs (PRDs), which focus on *what* a code change should do
>> >> as
>> >> opposed to how.
>> >>
>> >> In particular, here are some things that you may or may not consider in
>> >> scope for SIPs:
>> >>
>> >> - Goals and non-goals: This is definitely in scope, and IMO should
>> >> focus on
>> >> user-visible behavior (e.g. "system supports SQL window functions" or
>> >> "system continues working if one node fails"). BTW I wouldn't say
>> >> "rejected
>> >> goals" because some of them might become goals later, so we're not
>> >> definitively rejecting them.
>> >>
>> >> - Public API: Probably should be included in most SIPs unless it's too
>> >> large
>> >> to fully specify then (e.g. "let's add an ML library").
>> >>
>> >> - Use cases: I usually find this very useful in PRDs to better
>> >> communicate
>> >> the goals.
>> >>
>> >> - Internal architecture: This is usually *not* a thing users can easily
>> >> comment on and it sounds more like a design doc item. Of course it's
>> >> important to show that the SIP is feasible to implement. One exception,
>> >> however, is that I think we'll have some SIPs primarily on internals
>> >> (e.g.
>> >> if somebody wants to refactor Spark's query optimizer or something).
>> >>
>> >> - Rejected strategies: I personally wouldn't put this, because what's
>> >> the
>> >> point of voting to reject a strategy before you've really begun
>> >> designing
>> >> and implementing something? What if you discover that the strategy is
>> >> actually better when you start doing stuff?
>> >>
>> >> At a super high level, it depends on whether you want the SIPs to be
>> >> PRDs
>> >> for getting some quick feedback on the goals of a feature before it is
>> >> designed, or something more like full-fledged design docs (just a more
>> >> visible design doc for bigger changes). I looked at Kafka's KIPs, and
>> >> they
>> >> actually seem to be more like design docs. This can work too but it
>> >> does
>> >> require more work from the proposer and it can lead to the same
>> >> problems you
>> >> mentioned with people already having a design and implementation in
>> >> mind.
>> >>
>> >> Basically, the question is, are you trying to iterate faster on design
>> >> by
>> >> adding a step for user feedback earlier? Or are you just trying to make
>> >> design docs for key features more visible (and their approval more
>> >> formal)?
>> >>
>> >> BTW note that in either case, I'd like to have a template for design
>> >> docs
>> >> too, which should also include goals. I think that would've avoided
>> >> some of
>> >> the issues you brought up.
>> >>
>> >> Matei
>> >>
>> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <cody@koeninger.org> wrote:
>> >>
>> >> Here's my specific proposal (meta-proposal?)
>> >>
>> >> Spark Improvement Proposals (SIP)
>> >>
>> >>
>> >> Background:
>> >>
>> >> The current problem is that design and implementation of large features
>> >> are
>> >> often done in private, before soliciting user feedback.
>> >>
>> >> When feedback is solicited, it is often as to detailed design
>> >> specifics, not
>> >> focused on goals.
>> >>
>> >> When implementation does take place after design, there is often
>> >> disagreement as to what goals are or are not in scope.
>> >>
>> >> This results in commits that don't fully meet user needs.
>> >>
>> >>
>> >> Goals:
>> >>
>> >> - Ensure user, contributor, and committer goals are clearly identified
>> >> and
>> >> agreed upon, before implementation takes place.
>> >>
>> >> - Ensure that a technically feasible strategy is chosen that is likely
>> >> to
>> >> meet the goals.
>> >>
>> >>
>> >> Rejected Goals:
>> >>
>> >> - SIPs are not for detailed design.  Design by committee doesn't work.
>> >>
>> >> - SIPs are not for every change.  We dont need that much process.
>> >>
>> >>
>> >> Strategy:
>> >>
>> >> My suggestion is outlined as a Spark Improvement Proposal process
>> >> documented
>> >> at
>> >>
>> >>
>> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >>
>> >> Specifics of Jira manipulation are an implementation detail we can
>> >> figure
>> >> out.
>> >>
>> >> I'm suggesting voting; the need here is for a _clear_ outcome.
>> >>
>> >>
>> >> Rejected Strategies:
>> >>
>> >> Having someone who understands the problem implement it first works,
>> >> but
>> >> only if significant iteration after user feedback is allowed.
>> >>
>> >> Historically this has been problematic due to pressure to limit public
>> >> api
>> >> changes.
>> >>
>> >>
>> >> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <rxin@databricks.com>
>> >> wrote:
>> >>>
>> >>> Alright looks like there are quite a bit of support. We should wait
to
>> >>> hear from more people too.
>> >>>
>> >>> To push this forward, Cody and I will be working together in the next
>> >>> couple of weeks to come up with a concrete, detailed proposal on what
>> >>> this
>> >>> entails, and then we can discuss this the specific proposal as well.
>> >>>
>> >>>
>> >>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <cody@koeninger.org>
>> >>> wrote:
>> >>>>
>> >>>> Yeah, in case it wasn't clear, I was talking about SIPs for major
>> >>>> user-facing or cross-cutting changes, not minor feature adds.
>> >>>>
>> >>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>> >>>> <stavros.kontopoulos@lightbend.com> wrote:
>> >>>>>
>> >>>>> +1 to the SIP label as long as it does not slow down things
and it
>> >>>>> targets optimizing efforts, coordination etc. For example really
>> >>>>> small
>> >>>>> features should not need to go through this process (assuming
they
>> >>>>> dont
>> >>>>> touch public interfaces)  or re-factorings and hope it will
be kept
>> >>>>> this
>> >>>>> way. So as a guideline doc should be provided, like in the KIP
case.
>> >>>>>
>> >>>>> IMHO so far aside from tagging things and linking them elsewhere
>> >>>>> simply
>> >>>>> having design docs and prototypes implementations in PRs is
not
>> >>>>> something
>> >>>>> that has not worked so far. What is really a pain in many projects
>> >>>>> out there
>> >>>>> is discontinuity in progress of PRs, missing features, slow
reviews
>> >>>>> which is
>> >>>>> understandable to some extent... it is not only about Spark
but
>> >>>>> things can
>> >>>>> be improved for sure for this project in particular as already
>> >>>>> stated.
>> >>>>>
>> >>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <cody@koeninger.org>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> +1 to adding an SIP label and linking it from the website.
 I think
>> >>>>>> it
>> >>>>>> needs
>> >>>>>>
>> >>>>>> - template that focuses it towards soliciting user goals
/ non
>> >>>>>> goals
>> >>>>>> - clear resolution as to which strategy was chosen to pursue.
 I'd
>> >>>>>> recommend a vote.
>> >>>>>>
>> >>>>>> Matei asked me to clarify what I meant by changing interfaces,
I
>> >>>>>> think
>> >>>>>> it's directly relevant to the SIP idea so I'll clarify here,
and
>> >>>>>> split
>> >>>>>> a thread for the other discussion per Nicholas' request.
>> >>>>>>
>> >>>>>> I meant changing public user interfaces.  I think the first
design
>> >>>>>> is
>> >>>>>> unlikely to be right, because it's done at a time when you
have the
>> >>>>>> least information.  As a user, I find it considerably more
>> >>>>>> frustrating
>> >>>>>> to be unable to use a tool to get my job done, than I do
having to
>> >>>>>> make minor changes to my code in order to take advantage
of
>> >>>>>> features.
>> >>>>>> I've seen committers be seriously reluctant to allow changes
to
>> >>>>>> @experimental code that are needed in order for it to really
work
>> >>>>>> right.  You need to be able to iterate, and if people on
both sides
>> >>>>>> of
>> >>>>>> the fence aren't going to respect that some newer apis are
subject
>> >>>>>> to
>> >>>>>> change, then why even mark them as such?
>> >>>>>>
>> >>>>>> Ideally a finished SIP should give me a checklist of things
that an
>> >>>>>> implementation must do, and things that it doesn't need
to do.
>> >>>>>> Contributors/committers should be seriously discouraged
from
>> >>>>>> putting
>> >>>>>> out a version 0.1 that doesn't have at least a prototype
>> >>>>>> implementation of all those things, especially if they're
then
>> >>>>>> going
>> >>>>>> to argue against interface changes necessary to get the
the rest of
>> >>>>>> the things done in the 0.2 version.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <rxin@databricks.com>
>> >>>>>> wrote:
>> >>>>>>> I like the lightweight proposal to add a SIP label.
>> >>>>>>>
>> >>>>>>> During Spark 2.0 development, Tom (Graves) and I suggested
using
>> >>>>>>> wiki
>> >>>>>>> to
>> >>>>>>> track the list of major changes, but that never really
>> >>>>>>> materialized
>> >>>>>>> due to
>> >>>>>>> the overhead. Adding a SIP label on major JIRAs and
then link to
>> >>>>>>> them
>> >>>>>>> prominently on the Spark website makes a lot of sense.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>> >>>>>>> <matei.zaharia@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>> For the improvement proposals, I think one major
point was to
>> >>>>>>>> make
>> >>>>>>>> them
>> >>>>>>>> really visible to users who are not contributors,
so we should do
>> >>>>>>>> more than
>> >>>>>>>> sending stuff to dev@. One very lightweight idea
is to have a new
>> >>>>>>>> type of
>> >>>>>>>> JIRA called a SIP and have a link to a filter that
shows all such
>> >>>>>>>> JIRAs from
>> >>>>>>>> http://spark.apache.org. I also like the idea of
SIP and design
>> >>>>>>>> doc
>> >>>>>>>> templates (in fact many projects have them).
>> >>>>>>>>
>> >>>>>>>> Matei
>> >>>>>>>>
>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <rxin@databricks.com>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>> I called Cody last night and talked about some of
the topics in
>> >>>>>>>> his
>> >>>>>>>> email.
>> >>>>>>>> It became clear to me Cody genuinely cares about
the project.
>> >>>>>>>>
>> >>>>>>>> Some of the frustrations come from the success of
the project
>> >>>>>>>> itself
>> >>>>>>>> becoming very "hot", and it is difficult to get
clarity from
>> >>>>>>>> people
>> >>>>>>>> who
>> >>>>>>>> don't dedicate all their time to Spark. In fact,
it is in some
>> >>>>>>>> ways
>> >>>>>>>> similar
>> >>>>>>>> to scaling an engineering team in a successful startup:
old
>> >>>>>>>> processes that
>> >>>>>>>> worked well might not work so well when it gets
to a certain
>> >>>>>>>> size,
>> >>>>>>>> cultures
>> >>>>>>>> can get diluted, building culture vs building process,
etc.
>> >>>>>>>>
>> >>>>>>>> I also really like to have a more visible process
for larger
>> >>>>>>>> changes,
>> >>>>>>>> especially major user facing API changes. Historically
we upload
>> >>>>>>>> design docs
>> >>>>>>>> for major changes, but it is not always consistent
and difficult
>> >>>>>>>> to
>> >>>>>>>> quality
>> >>>>>>>> of the docs, due to the volunteering nature of the
organization.
>> >>>>>>>>
>> >>>>>>>> Some of the more concrete ideas we discussed focus
on building a
>> >>>>>>>> culture
>> >>>>>>>> to improve clarity:
>> >>>>>>>>
>> >>>>>>>> - Process: Large changes should have design docs
posted on JIRA.
>> >>>>>>>> One
>> >>>>>>>> thing
>> >>>>>>>> Cody and I didn't discuss but an idea that just
came to me is we
>> >>>>>>>> should
>> >>>>>>>> create a design doc template for the project and
ask everybody to
>> >>>>>>>> follow.
>> >>>>>>>> The design doc template should also explicitly list
goals and
>> >>>>>>>> non-goals, to
>> >>>>>>>> make design doc more consistent.
>> >>>>>>>>
>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have
some this with
>> >>>>>>>> some
>> >>>>>>>> changes, but again very inconsistent. Just posting
something on
>> >>>>>>>> JIRA
>> >>>>>>>> isn't
>> >>>>>>>> sufficient, because there are simply too many JIRAs
and the
>> >>>>>>>> signal
>> >>>>>>>> get lost
>> >>>>>>>> in the noise. While this is generally impossible
to enforce
>> >>>>>>>> because
>> >>>>>>>> we can't
>> >>>>>>>> force all volunteers to conform to a process (or
they might not
>> >>>>>>>> even
>> >>>>>>>> be
>> >>>>>>>> aware of this),  those who are more familiar with
the project can
>> >>>>>>>> help by
>> >>>>>>>> emailing the dev@ when they see something that hasn't
been.
>> >>>>>>>>
>> >>>>>>>> - Culture: The design doc author(s) should be open
to feedback. A
>> >>>>>>>> design
>> >>>>>>>> doc should serve as the base for discussion and
is by no means
>> >>>>>>>> the
>> >>>>>>>> final
>> >>>>>>>> design. Of course, this does not mean the author
has to accept
>> >>>>>>>> every
>> >>>>>>>> feedback. They should also be comfortable accepting
/ rejecting
>> >>>>>>>> ideas on
>> >>>>>>>> technical grounds.
>> >>>>>>>>
>> >>>>>>>> - Process / Culture: For major ongoing projects,
it can be useful
>> >>>>>>>> to
>> >>>>>>>> have
>> >>>>>>>> some monthly Google hangouts that are open to the
world. I am
>> >>>>>>>> actually not
>> >>>>>>>> sure how well this will work, because of the volunteering
nature
>> >>>>>>>> and
>> >>>>>>>> we need
>> >>>>>>>> to adjust for timezones for people across the globe,
but it seems
>> >>>>>>>> worth
>> >>>>>>>> trying.
>> >>>>>>>>
>> >>>>>>>> - Culture: Contributors (including committers) should
be more
>> >>>>>>>> direct
>> >>>>>>>> in
>> >>>>>>>> setting expectations, including whether they are
working on a
>> >>>>>>>> specific
>> >>>>>>>> issue, whether they will be working on a specific
issue, and
>> >>>>>>>> whether
>> >>>>>>>> an
>> >>>>>>>> issue or pr or jira should be rejected. Most people
I know in
>> >>>>>>>> this
>> >>>>>>>> community
>> >>>>>>>> are nice and don't enjoy telling other people no,
but it is often
>> >>>>>>>> more
>> >>>>>>>> annoying to a contributor to not know anything than
getting a no.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>> >>>>>>>> <matei.zaharia@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Love the idea of a more visible "Spark Improvement
Proposal"
>> >>>>>>>>> process that
>> >>>>>>>>> solicits user input on new APIs. For what it's
worth, I don't
>> >>>>>>>>> think
>> >>>>>>>>> committers are trying to minimize their own
work -- every
>> >>>>>>>>> committer
>> >>>>>>>>> cares
>> >>>>>>>>> about making the software useful for users.
However, it is
>> >>>>>>>>> always
>> >>>>>>>>> hard to
>> >>>>>>>>> get user input and so it helps to have this
kind of process.
>> >>>>>>>>> I've
>> >>>>>>>>> certainly
>> >>>>>>>>> looked at the *IPs a lot in other software I
use just to see the
>> >>>>>>>>> biggest
>> >>>>>>>>> things on the roadmap.
>> >>>>>>>>>
>> >>>>>>>>> When you're talking about "changing interfaces",
are you talking
>> >>>>>>>>> about
>> >>>>>>>>> public or internal APIs? I do think many people
hate changing
>> >>>>>>>>> public APIs
>> >>>>>>>>> and I actually think that's for the best of
the project. That's
>> >>>>>>>>> a
>> >>>>>>>>> technical
>> >>>>>>>>> debate, but basically, the worst thing when
you're using a piece
>> >>>>>>>>> of
>> >>>>>>>>> software
>> >>>>>>>>> is that the developers constantly ask you to
rewrite your app to
>> >>>>>>>>> update to a
>> >>>>>>>>> new version (and thus benefit from bug fixes,
etc). Cue anyone
>> >>>>>>>>> who's used
>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone
to change their code
>> >>>>>>>>> this
>> >>>>>>>>> release" model works well within a single large
company, but
>> >>>>>>>>> doesn't work
>> >>>>>>>>> well for a community, which is why nearly all
*very* widely used
>> >>>>>>>>> programming
>> >>>>>>>>> interfaces (I'm talking things like Java standard
library,
>> >>>>>>>>> Windows
>> >>>>>>>>> API, etc)
>> >>>>>>>>> almost *never* break backwards compatibility.
All this is done
>> >>>>>>>>> within reason
>> >>>>>>>>> though, e.g. we do change things in major releases
(2.x, 3.x,
>> >>>>>>>>> etc).
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Stavros Kontopoulos
>> >>>>> Senior Software Engineer
>> >>>>> Lightbend, Inc.
>> >>>>> p:  +30 6977967274
>> >>>>> e: stavros.kontopoulos@lightbend.com
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message