spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <nick.pentre...@gmail.com>
Subject Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Date Wed, 06 Apr 2016 04:12:58 GMT
+1 for this proposal - as you mention I think it's the defacto current
situation anyway.

Note that from a developer view it's just the user-facing API that will be
only "ml" - the majority of the actual algorithms still operate on RDDs
under the good currently.
On Wed, 6 Apr 2016 at 05:03, Chris Fregly <chris@fregly.com> wrote:

> perhaps renaming to Spark ML would actually clear up code and
> documentation confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin <rxin@databricks.com> wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <joseph@databricks.com>
> wrote:
>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <matei.zaharia@gmail.com>
>> wrote:
>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally
>>> move towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <meng@databricks.com> wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>>> shivaram@eecs.berkeley.edu> wrote:
>>>
>>>> Overall this sounds good to me. One question I have is that in
>>>> addition to the ML algorithms we have a number of linear algebra
>>>> (various distributed matrices) and statistical methods in the
>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>> namespace in the 2.x series ?
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <sowen@cloudera.com> wrote:
>>>> > FWIW, all of that sounds like a good plan to me. Developing one API
is
>>>> > certainly better than two.
>>>> >
>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <mengxr@gmail.com>
>>>> wrote:
>>>> >> Hi all,
>>>> >>
>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline
API
>>>> built
>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>>> API has
>>>> >> been developed under the spark.ml package, while the old RDD-based
>>>> API has
>>>> >> been developed in parallel under the spark.mllib package. While
it
>>>> was
>>>> >> easier to implement and experiment with new APIs under a new
>>>> package, it
>>>> >> became harder and harder to maintain as both packages grew bigger
and
>>>> >> bigger. And new users are often confused by having two sets of APIs
>>>> with
>>>> >> overlapped functions.
>>>> >>
>>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>>> API in
>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>>> development
>>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>>> counting
>>>> >> the lines of Scala code, from 1.5 to the current master we added
>>>> ~10000
>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API.
>>>> So, to
>>>> >> gather more resources on the development of the DataFrame-based
API
>>>> and to
>>>> >> help users migrate over sooner, I want to propose switching
>>>> RDD-based MLlib
>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>> >>
>>>> >> * We do not accept new features in the RDD-based spark.mllib
>>>> package, unless
>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>> >> package.
>>>> >> * We still accept bug fixes in the RDD-based API.
>>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>>> series to
>>>> >> reach feature parity with the RDD-based API.
>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>>> deprecate
>>>> >> the RDD-based API.
>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>>> 3.0.
>>>> >>
>>>> >> Though the RDD-based API is already in de facto maintenance mode,
>>>> this
>>>> >> announcement will make it clear and hence important to both MLlib
>>>> developers
>>>> >> and users. So we’d greatly appreciate your feedback!
>>>> >>
>>>> >> (As a side note, people sometimes use “Spark ML” to refer to
the
>>>> >> DataFrame-based API or even the entire MLlib component. This also
>>>> causes
>>>> >> confusion. To be clear, “Spark ML” is not an official name and
there
>>>> are no
>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>> >>
>>>> >> Best,
>>>> >> Xiangrui
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>> >
>>>>
>>>
>>>
>>
>

Mime
View raw message