spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject [VOTE] Designating maintainers for some Spark components
Date Thu, 06 Nov 2014 01:31:58 GMT
Hi all,

I wanted to share a discussion we've been having on the PMC list, as well as call for an official
vote on it on a public list. Basically, as the Spark project scales up, we need to define
a model to make sure there is still great oversight of key components (in particular internal
architecture and public APIs), and to this end I've proposed implementing a maintainer model
for some of these components, similar to other large projects.

As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month
for the past 3 months, which I believe makes us the most active project in contributors/month
at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with
new libraries for SQL, ML, graphs and more.

In this kind of large project, one common way to scale development is to assign "maintainers"
to oversee key components, where each patch to that component needs to get sign-off from at
least one of its maintainers. Most existing large projects do this -- at Apache, some large
ones with this model are CloudStack (the second-most active project overall), Subversion,
and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark
operates today -- most components have a de-facto maintainer.

IMO, adopting this model would have two benefits:

1) Consistent oversight of design for that component, especially regarding architecture and
API. This process would ensure that the component's maintainers see all proposed changes and
consider them to fit together in a good way.

2) More structure for new contributors and committers -- in particular, it would be easy to
look up who’s responsible for each module and ask them for reviews, etc, rather than having
patches slip between the cracks.

We'd like to start with in a light-weight manner, where the model only applies to certain
key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time,
as the project grows, we can expand it if we deem it useful. The specific mechanics would
be as follows:

- Some components in Spark will have maintainers assigned to them, where one of the maintainers
needs to sign off on each patch to the component.
- Each component with maintainers will have at least 2 maintainers.
- Maintainers will be assigned from the most active and knowledgeable committers on that component
by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through
consensus.
- Maintainers are expected to be active in responding to patches for their components, though
they do not need to be the main reviewers for them (e.g. they might just sign off on architecture
/ API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding
in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC
will want to discuss adding another maintainer.

If you'd like to see examples for this model, check out the following projects:
- CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
<https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide>

- Subversion: https://subversion.apache.org/docs/community-guide/roles.html <https://subversion.apache.org/docs/community-guide/roles.html>

Finally, I wanted to list our current proposal for initial components and maintainers. It
would be good to get feedback on other components we might add, but please note that personnel
discussions (e.g. "I don't think Matei should maintain *that* component) should only happen
on the private list. The initial components were chosen to include all public APIs and the
main core components, and the maintainers were chosen from the most active contributors to
those modules.

- Spark core public API: Matei, Patrick, Reynold
- Job scheduler: Matei, Kay, Patrick
- Shuffle and network: Reynold, Aaron, Matei
- Block manager: Reynold, Aaron
- YARN: Tom, Andrew Or
- Python: Josh, Matei
- MLlib: Xiangrui, Matei
- SQL: Michael, Reynold
- Streaming: TD, Matei
- GraphX: Ankur, Joey, Reynold

I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on
Nov 8, 2014 at 6 PM PST.

Matei
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message