spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: Spark Improvement Proposals
Date Fri, 07 Oct 2016 04:14:21 GMT
Hey Cody,

Thanks for bringing these things up. You're talking about quite a few different things here,
but let me get to them each in turn.

1) About technical / design discussion -- I fully agree that everything big should go through
a lot of review, and I like the idea of a more formal way to propose and comment on larger
features. So far, all of this has been done through JIRA, but as a start, maybe marking JIRAs
as large (we often use Umbrella for this) and also opening a thread on the list about each
such JIRA would help. For Structured Streaming in particular, FWIW, there was a pretty complete
doc on the proposed semantics at since March.
But it's true that other things such as the Kafka source for it didn't have as much design
on JIRA. Nonetheless, this component is still early on and there's still a lot of time to
change it, which is happening.

2) About what people say at Reactive Summit -- there will always be trolls, but just ignore
them and build a great project. Those of us involved in the project for a while have long
seen similar stuff, e.g. a prominent company saying Spark doesn't scale past 100 nodes when
there were many documented instances to the contrary, and the best answer is to just make
the project better. This same company, if you read their website now, recommends Apache Spark
for most anything. For streaming in particular, there is a lot of confusion because many of
the concepts aren't well-defined (e.g. what is "at least once", etc), and it's also a crowded
space. But Spark Streaming prioritizes a few things that it does very well: correctness (you
can easily tell what the app will do, and it does the same thing despite failures), ease of
programming (which also requires correctness), and scalability. We should of course both explain
what it does in more places and work on improving it where needed (e.g. adding a higher level
API with Structured Streaming and built-in primitives for external timestamps).

3) About number and diversity of committers -- the PMC is always working to expand these,
and you should email people on the PMC (or even the whole list) if you have people you'd like
to propose. In general I think nearly all committers added in the past year were from organizations
that haven't long been involved in Spark, and the number of committers continues to grow pretty

4) Finally, about better organizing JIRA, marking dead issues, etc, this would be great and
I think we just need a concrete proposal for how to do it. It would be best to point to an
existing process that someone else has used here BTW so that we can see it in action.


> On Oct 6, 2016, at 7:51 PM, Cody Koeninger <> wrote:
> I love Spark.  3 or 4 years ago it was the first distributed computing
> environment that felt usable, and the community was welcoming.
> But I just got back from the Reactive Summit, and this is what I observed:
> - Industry leaders on stage making fun of Spark's streaming model
> - Open source project leaders saying they looked at Spark's governance
> as a model to avoid
> - Users saying they chose Flink because it was technically superior
> and they couldn't get any answers on the Spark mailing lists
> Whether you agree with the substance of any of this, when this stuff
> gets repeated enough people will believe it.
> Right now Spark is suffering from its own success, and I think
> something needs to change.
> - We need a clear process for planning significant changes to the codebase.
> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> but you need a documented process with a clear outcome (e.g. a vote).
> Passing around google docs after an implementation has largely been
> decided on doesn't cut it.
> - All technical communication needs to be public.
> Things getting decided in private chat, or when 1/3 of the committers
> work for the same company and can just talk to each other...
> Yes, it's convenient, but it's ultimately detrimental to the health of
> the project.
> The way structured streaming has played out has shown that there are
> significant technical blind spots (myself included).
> One way to address that is to get the people who have domain knowledge
> involved, and listen to them.
> - We need more committers, and more committer diversity.
> Per committer there are, what, more than 20 contributors and 10 new
> jira tickets a month?  It's too much.
> There are people (I am _not_ referring to myself) who have been around
> for years, contributed thousands of lines of code, helped educate the
> public around Spark... and yet are never going to be voted in.
> - We need a clear process for managing volunteer work.
> Too many tickets sit around unowned, unclosed, uncertain.
> If someone proposed something and it isn't up to snuff, tell them and
> close it.  It may be blunt, but it's clearer than "silent no".
> If someone wants to work on something, let them own the ticket and set
> a deadline. If they don't meet it, close it or reassign it.
> This is not me putting on an Apache Bureaucracy hat.  This is me
> saying, as a fellow hacker and loyal dissenter, something is wrong
> with the culture and process.
> Please, let's change it.
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

To unsubscribe e-mail:

View raw message