spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiao Li <>
Subject Re: Spark Improvement Proposals
Date Fri, 07 Oct 2016 04:53:16 GMT
Let us continue to improve Apache Spark!

I volunteer to go through all the SQL-related open JIRAs.

Xiao Li

2016-10-06 21:14 GMT-07:00 Matei Zaharia <>:
> Hey Cody,
> Thanks for bringing these things up. You're talking about quite a few different things
here, but let me get to them each in turn.
> 1) About technical / design discussion -- I fully agree that everything big should go
through a lot of review, and I like the idea of a more formal way to propose and comment on
larger features. So far, all of this has been done through JIRA, but as a start, maybe marking
JIRAs as large (we often use Umbrella for this) and also opening a thread on the list about
each such JIRA would help. For Structured Streaming in particular, FWIW, there was a pretty
complete doc on the proposed semantics at
since March. But it's true that other things such as the Kafka source for it didn't have as
much design on JIRA. Nonetheless, this component is still early on and there's still a lot
of time to change it, which is happening.
> 2) About what people say at Reactive Summit -- there will always be trolls, but just
ignore them and build a great project. Those of us involved in the project for a while have
long seen similar stuff, e.g. a prominent company saying Spark doesn't scale past 100 nodes
when there were many documented instances to the contrary, and the best answer is to just
make the project better. This same company, if you read their website now, recommends Apache
Spark for most anything. For streaming in particular, there is a lot of confusion because
many of the concepts aren't well-defined (e.g. what is "at least once", etc), and it's also
a crowded space. But Spark Streaming prioritizes a few things that it does very well: correctness
(you can easily tell what the app will do, and it does the same thing despite failures), ease
of programming (which also requires correctness), and scalability. We should of course both
explain what it does in more places and work on improving it where needed (e.g. adding a higher
level API with Structured Streaming and built-in primitives for external timestamps).
> 3) About number and diversity of committers -- the PMC is always working to expand these,
and you should email people on the PMC (or even the whole list) if you have people you'd like
to propose. In general I think nearly all committers added in the past year were from organizations
that haven't long been involved in Spark, and the number of committers continues to grow pretty
> 4) Finally, about better organizing JIRA, marking dead issues, etc, this would be great
and I think we just need a concrete proposal for how to do it. It would be best to point to
an existing process that someone else has used here BTW so that we can see it in action.
> Matei
>> On Oct 6, 2016, at 7:51 PM, Cody Koeninger <> wrote:
>> I love Spark.  3 or 4 years ago it was the first distributed computing
>> environment that felt usable, and the community was welcoming.
>> But I just got back from the Reactive Summit, and this is what I observed:
>> - Industry leaders on stage making fun of Spark's streaming model
>> - Open source project leaders saying they looked at Spark's governance
>> as a model to avoid
>> - Users saying they chose Flink because it was technically superior
>> and they couldn't get any answers on the Spark mailing lists
>> Whether you agree with the substance of any of this, when this stuff
>> gets repeated enough people will believe it.
>> Right now Spark is suffering from its own success, and I think
>> something needs to change.
>> - We need a clear process for planning significant changes to the codebase.
>> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>> but you need a documented process with a clear outcome (e.g. a vote).
>> Passing around google docs after an implementation has largely been
>> decided on doesn't cut it.
>> - All technical communication needs to be public.
>> Things getting decided in private chat, or when 1/3 of the committers
>> work for the same company and can just talk to each other...
>> Yes, it's convenient, but it's ultimately detrimental to the health of
>> the project.
>> The way structured streaming has played out has shown that there are
>> significant technical blind spots (myself included).
>> One way to address that is to get the people who have domain knowledge
>> involved, and listen to them.
>> - We need more committers, and more committer diversity.
>> Per committer there are, what, more than 20 contributors and 10 new
>> jira tickets a month?  It's too much.
>> There are people (I am _not_ referring to myself) who have been around
>> for years, contributed thousands of lines of code, helped educate the
>> public around Spark... and yet are never going to be voted in.
>> - We need a clear process for managing volunteer work.
>> Too many tickets sit around unowned, unclosed, uncertain.
>> If someone proposed something and it isn't up to snuff, tell them and
>> close it.  It may be blunt, but it's clearer than "silent no".
>> If someone wants to work on something, let them own the ticket and set
>> a deadline. If they don't meet it, close it or reassign it.
>> This is not me putting on an Apache Bureaucracy hat.  This is me
>> saying, as a fellow hacker and loyal dissenter, something is wrong
>> with the culture and process.
>> Please, let's change it.
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

To unsubscribe e-mail:

View raw message