spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <m...@databricks.com>
Subject Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Date Tue, 05 Apr 2016 20:44:35 GMT
Yes, DB (cc'ed) is working on porting the local linear algebra library over
(SPARK-13944). There are also frequent pattern mining algorithms we need to
port over in order to reach feature parity. -Xiangrui

On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
shivaram@eecs.berkeley.edu> wrote:

> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml
> namespace in the 2.x series ?
>
> Thanks
> Shivaram
>
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <sowen@cloudera.com> wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
> built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
> API has
> >> been developed under the spark.ml package, while the old RDD-based API
> has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API
> in
> >> Spark 1.5 for its versatility and flexibility, and we saw the
> development
> >> and the usage gradually shifting to the DataFrame-based API. Just
> counting
> >> the lines of Scala code, from 1.5 to the current master we added ~10000
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and
> to
> >> help users migrate over sooner, I want to propose switching RDD-based
> MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package,
> unless
> >> they block implementing new features in the DataFrame-based spark.ml
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x
> series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will
> deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark
> 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib
> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there
> are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>

Mime
View raw message