spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <felixcheun...@hotmail.com>
Subject Re: Feedback on MLlib roadmap process proposal
Date Thu, 19 Jan 2017 21:51:12 GMT
Hi Seth

Re: "The most important thing we can do, given that MLlib currently has a very limited committer
review bandwidth, is to make clear issues that, if worked on, will definitely get reviewed.
"

We are adopting a Shepherd model, as described in the JIRA Joseph has, in which, when assigned,
the Shepherd will see it through with the contributor to make sure it lands with the target
release.

I'm sure Joseph can explain it better than I do ;)


_____________________________
From: Mingjie Tang <tangrock@gmail.com<mailto:tangrock@gmail.com>>
Sent: Thursday, January 19, 2017 10:30 AM
Subject: Re: Feedback on MLlib roadmap process proposal
To: Seth Hendrickson <seth.hendrickson16@gmail.com<mailto:seth.hendrickson16@gmail.com>>
Cc: Joseph Bradley <joseph@databricks.com<mailto:joseph@databricks.com>>, <dev@spark.apache.org<mailto:dev@spark.apache.org>>


+1 general abstractions like distributed linear algebra.

On Thu, Jan 19, 2017 at 8:54 AM, Seth Hendrickson <seth.hendrickson16@gmail.com<mailto:seth.hendrickson16@gmail.com>>
wrote:
I think the proposal laid out in SPARK-18813 is well done, and I do think it is going to improve
the process going forward. I also really like the idea of getting the community to vote on
JIRAs to give some of them priority - provided that we listen to those votes, of course. The
biggest problem I see is that we do have several active contributors and those who want to
help implement these changes, but PRs are reviewed rather sporadically and I imagine it is
very difficult for contributors to understand why some get reviewed and some do not. The most
important thing we can do, given that MLlib currently has a very limited committer review
bandwidth, is to make clear issues that, if worked on, will definitely get reviewed. A hard
thing to do in open source, no doubt, but even if we have to limit the scope of such issues
to a very small subset, it's a gain for all I think.

On a related note, I would love to hear some discussion on the higher level goal of Spark
MLlib (if this derails the original discussion, please let me know and we can discuss in another
thread). The roadmap does contain specific items that help to convey some of this (ML parity
with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark
MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes
not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now
that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more
robust? Should we aim to make interfaces that are easily extended for developers to easily
implement their own custom code (e.g. custom optimization libraries), or do we want to restrict
things to out-of-the box algorithms? Should we focus on more flexible, general abstractions
like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this discussion may have
happened, but I think it would be useful to either revisit it or restate it here for some
of the newer developers.

On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley <joseph@databricks.com<mailto:joseph@databricks.com>>
wrote:
Hi all,

This is a general call for thoughts about the process for the MLlib roadmap proposed in SPARK-18813.
 See the section called "Roadmap process."

Summary:
* This process is about committers indicating intention to shepherd and review.
* The goal is to improve visibility and communication.
* This is fairly orthogonal to the SIP discussion since this proposal is more about setting
release targets than about proposing future plans.

Thanks!
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>





Mime
View raw message