spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Hunter <timhun...@databricks.com>
Subject Re: Feedback on MLlib roadmap process proposal
Date Thu, 23 Feb 2017 19:38:50 GMT
As Sean wrote very nicely above, the changes made to Spark are decided in
an organic fashion based on the interests and motivations of the committers
and contributors. The case of deep learning is a good example. There is a
lot of interest, and the core algorithms could be implemented without too
much problem in a few thousands of lines of scala code. However, the
performance of such a simple implementation would be one to two order of
magnitude slower than what would get from the popular frameworks out there.

At this point, there are probably more man-hours invested in TensorFlow (as
an example) than in MLlib, so I think we need to be realistic about what we
can expect to achieve inside Spark. Unlike BLAS for linear algebra, there
is no agreed-up interface for deep learning, and each of the XOnSpark
flavors explores a slightly different design. It will be interesting to see
what works well in practice. In the meantime, though, there are plenty of
things that we could do to help developers of other libraries to have a
great experience with Spark. Matei alluded to that in his Spark Summit
keynote when he mentioned better integration with low-level libraries.

Tim


On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath <nick.pentreath@gmail.com>
wrote:

> Sorry for being late to the discussion. I think Joseph, Sean and others
> have covered the issues well.
>
> Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
> As for the actual critical roadmap items mentioned on SPARK-18813, I think
> it makes sense and will comment a bit further on that JIRA.
>
> I would like to encourage votes & watching for issues to give a sense of
> what the community wants (I guess Vote is more explicit yet passive, while
> actually Watching an issue is more informative as it may indicate a real
> use case dependent on the issue?!).
>
> I think if used well this is valuable information for contributors. Of
> course not everything on that list can get done. But if I look through the
> top votes or watch list, while not all of those are likely to go in, a
> great many of the issues are fairly non-contentious in terms of being good
> additions to the project.
>
> Things like these are good examples IMO (I just sample a few of them, not
> exhaustive):
> - sample weights for RF / DT
> - multi-model and/or parallel model selection
> - make sharedParams public?
> - multi-column support for various transformers
> - incremental model training
> - tree algorithm enhancements
>
> Now, whether these can be prioritised in terms of bandwidth available to
> reviewers and committers is a totally different thing. But as Sean mentions
> there is some process there for trying to find the balance of the issue
> being a "good thing to add", a shepherd with bandwidth & interest in the
> issue to review, and the maintenance burden imposed.
>
> Let's take Deep Learning / NN for example. Here's a good example of
> something that has a lot of votes/watchers and as Sean mentions it is
> something that "everyone wants someone else to implement". In this case,
> much of the interest may in fact be "stale" - 2 years ago it would have
> been very interesting to have a strong DL impl in Spark. Now, because there
> are a plethora of very good DL libraries out there, how many of those Votes
> would be "deleted"? Granted few are well integrated with Spark but that can
> and is changing (DL4J, BigDL, the "XonSpark" flavours etc).
>
> So this is something that I dare say will not be in Spark any time in the
> foreseeable future or perhaps ever given the current status. Perhaps it's
> worth seriously thinking about just closing these kind of issues?
>
>
>
> On Fri, 27 Jan 2017 at 05:53 Joseph Bradley <joseph@databricks.com> wrote:
>
>> Sean has given a great explanation.  A few more comments:
>>
>> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
>> have all committers working on MLlib help to set that roadmap, based on
>> either their knowledge of current maintenance/internal needs of the project
>> or the feedback given from the rest of the community.
>> @Committers - I see people actively shepherding PRs for MLlib, but I
>> don't see many major initiatives linked to the roadmap.  If there are ones
>> large enough to merit adding to the roadmap, please do.
>>
>> In general, there are many process improvements we could make.  A few in
>> my mind are:
>> * Visibility: Let the community know what committers are focusing on.
>> This was the primary purpose of the "MLlib roadmap proposal."
>> * Community initiatives: This is currently very organic.  Some of the
>> organic process could be improved, such as encouraging Votes/Watchers
>> (though I agree with Sean about these being one-sided metrics).  Cody's SIP
>> work is a great step towards adding more clarity and structure for major
>> initiatives.
>> * JIRA hygiene: Always a challenge, and always requires some manual
>> prodding.  But it's great to push for efforts on this.
>>
>>
>> On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen <sowen@cloudera.com> wrote:
>>
>> On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach <ilmat@microsoft.com> wrote:
>>
>> My confusion was that the ML 2.2 roadmap critical features (
>> https://issues.apache.org/jira/browse/SPARK-18813) did not line up with
>> the top ML/MLLIB JIRAs by Votes
>> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520votes%2520DESC&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=%2FtFB0LY%2BIxLoEf%2FPr1i1%2FgvrjlpXPuYLSLbpnd89Tkg%3D&reserved=0>or
>> Watchers
>> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520Watchers%2520DESC&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=XkPfFiB2T%2FoVnJcdr3jf12dQjes7w%2BVJMrbhgx3ELRs%3D&reserved=0>
>> .
>>
>> Your explanation that they do not have to and there is a more complex
>> process to choosing the changes that will make it into the next release
>> makes sense to me.
>>
>>
>> For Spark ML, Joseph is the de facto leader and does publish a tentative
>> roadmap. (We could also use JIRA mechanisms for this but any scheme is
>> better than none.) Yes, not based on Votes -- nothing here is. Votes are
>> noisy signal because it is usually measures: what would you like done if
>> you didn't have to do it and there were no downsides for you?
>>
>>
>>
>> My only humble recommendation would be to cleanup the top JIRAs by
>> closing the ones which have spark packages for them (eg the NN one which
>> already has several packages as you explained), noting or somehow marking
>> on some that they will not be resolved, and changing the component on the
>> ones not related to ML/MLLIB (eg https://issues.apache.org/
>> jira/browse/SPARK-12965).
>>
>>
>> We do that. It occasionally generates protests, so, I find myself erring
>> on the side of ignoring. You can comment on any JIRA you think should be
>> closed. That's helpful.
>>
>> That particular JIRA seems potentially legitimate. I wouldn't close it.
>> It also won't get fixed until someone proposes a resolution. I'd strongly
>> encourage people saying "I have this problem too" to try to fix it. I tend
>> to ignore these otherwise, myself, in favor of reviewing ones where someone
>> has gone to the trouble of proposing a working fix.
>>
>>
>>
>> Also, I would love to do this if I had the permissions, but it would be
>> great to change the JIRAs that are marked as “in progress” but where the
>> corresponding pull request was closed/cancelled, for example
>> https://issues.apache.org/jira/browse/SPARK-4638.  That JIRA is
>>
>>
>> Yes, flag these. I or others can close them if appropriate. Anyone who
>> consistently does this well, we could give JIRA permissions to.
>>
>> Opening a PR automatically makes it "In Progress" but there's no
>> complementary process to un-mark it. You can ignore the Open / In Progress
>> distinction really.
>>
>> This one is interesting because it does seem like a plausible feature to
>> add. The original PR was abandoned by the author and nobody else submitted
>> one -- despite the Votes. I hesitate to signal that no PRs would be
>> considered, but, doesn't seem like it's in demand enough for someone to
>> work on?
>>
>>
>> I think one of my messages is that, de facto, here, like in many Apache
>> projects, committers do not take requests. They pursue the work they
>> believe needs doing, and shepherd work initiated by others (a clear bug
>> report, a PR) to a resolution. Things get done by doing them, or by
>> building influence by doing other things the project needs doing. It isn't
>> a mechanical, objective process, and can't be. But it does work in a
>> recognizable way.
>>
>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>

Mime
View raw message