spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <>
Subject Re: Spark Improvement Proposals
Date Fri, 07 Oct 2016 16:16:38 GMT
First off, thanks Cody for taking the time to put together these proposals
- I think it has kicked off some wonderful discussion.

I think dismissing people's complaints with Spark as largely trolls does us
a disservice, it’s important for us to recognize our own shortcomings -
otherwise we are blind to the weak spots where we need to improve and
instead focus on new features. Parts of the Python community seem to be
actively looking for alternatives, and I’d obviously like Spark continue to
be the place where we come together and collaborate from different

I’d be more than happy to do a review of the outstanding Python PRs (I’ve
been keeping on top of the new ones but largely haven’t looked at the older
ones) and if there is a committer (maybe Davies or Sean?) who would be able
to help out with merging them once they are ready that would be awesome.
I’m at PyData DC this weekend but I’ll also start going through some of the
older Python JIRAs and seeing if they are still relevant, already fixed, or
something we are unlikely to be interested in bringing into Spark.

I’m giving a talk later on this month on how to get started contributing to
Apache Spark at OSCON London, and when I’ve given this talk before I’ve had
to include a fair number of warnings about the challenges that can face a
new contributor. I’d love to be able to drop those in future versions :)


As one of the non-committers who has been working on Spark for several
years (see ) I have strong feelings around the current
process being used for committers - but since I’m not on the PMC (catch-22
style) it's difficult to have any visibility into the process, so someone
who does will have to weigh in on that :)

On Fri, Oct 7, 2016 at 8:00 AM, Cody Koeninger <> wrote:

> Sean, that was very eloquently put, and I 100% agree.  If I ever meet
> you in person, I'll buy you multiple rounds of beverages of your
> choice ;)
> This is probably reiterating some of what you said in a less clear
> manner, but I'll throw more of my 2 cents in.
> - Design.
> Yes, design by committee doesn't work.  The best designs are when a
> person who understands the problem builds something that works for
> them, shares with others, and most importantly iterates when it
> doesn't work for others.  This iteration only works if you're willing
> to change interfaces, but committer and user goals are not aligned
> here.  Users want something that is clearly documented and helps them
> get their job done.  Committers (not all) want to minimize interface
> change, even at the expense of users being able to do their jobs.  In
> this situation, it is critical that you understand early what users
> need to be able to do.  This is what the improvement proposal process
> should focus on: Goals, non-goals, possible solutions, rejected
> solutions.  Not class-level design.  Most importantly, it needs a
> clear, unambiguous outcome that is visible to the public.
> - Trolling
> It's not just trolling.  Event time and kafka are technically
> important and should not be ignored.  I've been banging this drum for
> years.  These concerns haven't been fully heard and understood by
> committers.  This one example of why diversity of enfranchised users
> is important and governance concerns shouldn't be ignored.
> - Jira
> Concretely, automate closing stale jiras after X amount of time.  It's
> really surprising to me how much reluctance a community of programmers
> have shown towards automating their own processes around stuff like
> this (not to mention automatic code formatting of modified files).  I
> understand the arguments against. but the current alternative doesn't
> work.
> Concretely, clearly reject and close jiras.  I have a backlog of 50+
> kafka jiras, many of which are irrelevant at this point, but I do not
> feel that I have the political power to close them.
> Concretely, make it clear who is working on something.  This can be as
> simple as just "I'm working on this", assign it to me, if I don't
> follow up in X amount of time, close it or reassign.  That doesn't
> mean there can't be competing work, but it does mean those people
> should talk to each other.  Conversely, if committers currently don't
> have time to work on something that is important, make that clear in
> the ticket.
> On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen <> wrote:
> > Suggestion actions way at the bottom.
> >
> > On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <>
> > wrote:
> >>
> >> since March. But it's true that other things such as the Kafka source
> for
> >> it didn't have as much design on JIRA. Nonetheless, this component is
> still
> >> early on and there's still a lot of time to change it, which is
> happening.
> >
> >
> > It's hard to drive design discussions in OSS. Even when diligently
> > publishing design docs, the doc happens after brainstorming, and that
> > happens inside someone's head or in chats.
> >
> > The lazy consensus model that works for small changes doesn't work well
> > here. If a committer wants a change, that change will basically be made
> > modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
> > nothing done.) However this model means it's hard to significantly
> change a
> > design after draft 1.
> >
> > I've heard this complaint a few times, and it has never been down to bad
> > faith. We should err further towards over-including early and often. I've
> > seen some great discussions start more with a problem statement and an
> RFC,
> > not a design doc. Keeping regular contributors enfranchised is
> essential, so
> > that they're willing and able to participate when design time comes. (See
> > below.)
> >
> >
> >>
> >> 2) About what people say at Reactive Summit -- there will always be
> >> trolls, but just ignore them and build a great project. Those of us
> involved
> >> in the project for a while have long seen similar stuff, e.g. a
> >
> >
> > The hype cycle may be turning against Spark, as is normal for this stage
> of
> > maturity. People idealize technologies they don't really use as greener
> > grass; it's the things they use and need to work that they love to hate.
> >
> > I would not dismiss this as just trolling. Customer anecdotes I see
> suggest
> > that Spark underperforms their (inflated) expectations, and generally
> does
> > not Just Work. It takes expertise, tuning, patience, workarounds. And
> then
> > it gets great things done. I do see a gap between how the group here
> talks
> > about the technology, and how the users I see talk about it. The gap
> > manifests in attention given to making yet more things, and attention
> given
> > to fixing and project mechanics.
> >
> > I would also not dismiss criticism of governance. We can recognize some
> big
> > problems that were resolved over even the past 3 months. Usually I hear,
> > well, we do better than most projects, right? and that is true. But,
> Spark
> > is bigger and busier than most any other project. Exceptional projects
> need
> > exceptional governance and we have merely "good". See next.
> >
> >
> >> 3) About number and diversity of committers -- the PMC is always working
> >> to expand these, and you should email people on the PMC (or even the
> whole
> >> list) if you have people you'd like to propose. In
> >
> >
> > If you're suggesting that it's mostly a matter of asking, then this
> doesn't
> > match my experience. I have seen a few people consistently soft-reject
> most
> > proposals. The reasons given usually sound like "concerns about quality",
> > which is probably the right answer to a somewhat wrong question.
> >
> > We should probably be asking primarily who will net-net add efficiency to
> > some part of the project's mechanics. Per above, it wouldn't hurt to ask
> who
> > would expand coverage and add diversity of perspective too.
> >
> > I disagree that committers are being added at a sufficient rate. The
> overall
> > committer-attention hours is dropping as the project grows -- am I the
> only
> > one that perceives many regular committers aren't working nearly as much
> as
> > before on the project?
> >
> > I call it a problem because we have IMHO people who 'qualify', and not
> > giving them some stake is going to cost the project down the road.
> Always Be
> > Recruiting. This is what I would worry about, since the governance and
> > enfranchisement issues above kind of stem from this.
> >
> >
> >>
> >> 4) Finally, about better organizing JIRA, marking dead issues, etc, this
> >> would be great and I think we just need a concrete proposal for how to
> do
> >> it. It would be best to point to an existing process that someone else
> has
> >> used here BTW so that we can see it in action.
> >
> >
> > I don't think we're wanting for proposals. I went on and on about it last
> > year, and don't think anyone disagreed about actions. I wouldn't suggest
> > that clearing out dead issues is more complex than just putting in time
> to
> > do it. It's just grunt work and understandably not appealing. (Thank you
> > Xiao for your recent run at SQL JIRAs.)
> >
> > It requires saying 'no', which is hard, because it requires some
> conviction.
> > I have encountered reluctance to do this in Spark and think that culture
> > should change. Is it weird to say that a broader group of gatekeepers can
> > actually with more confidence and efficiency tackle the triage issue?
> that
> > pushing back on 'bad' contribution actually increases the rate of 'good'?
> >
> > FWIW I also find the project unpleasant to deal with day to day, mostly
> > because of the scale of the triage, and think we could use all the
> qualified
> > help we can get. I am looking to do less with the project over time,
> which
> > is no big deal in itself, but is a big deal if these several factors are
> > adding up to discourage fresh blood from joining the fray. Cody makes me
> > think there are, at least, 2 of us.
> >
> > Concrete steps?
> >
> > Go to Look at "Users". Look at your open PRs. Are any
> stale?
> > can you close them or advance them?
> >
> > Look at the Stale PRs tab and sort by last updated. Do any look dead? can
> > you ask the author to update or close? does the parent JIRA look like
> it's
> > not otherwise relevant?
> >
> > Go download JIRA Client at
> Go
> > look at all open JIRAs sorted by last update. Are any pretty obviously
> > obsolete?
> >
> > If you don't feel comfortable acting, feel free to at least propose a
> list
> > to dev@ for a look.
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

Cell : 425-233-8271

View raw message