spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Fregly <ch...@fregly.com>
Subject Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Date Wed, 06 Apr 2016 03:02:50 GMT
perhaps renaming to Spark ML would actually clear up code and documentation confusion?

+1 for rename 

> On Apr 5, 2016, at 7:00 PM, Reynold Xin <rxin@databricks.com> wrote:
> 
> +1
> 
> This is a no brainer IMO.
> 
> 
>> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley <joseph@databricks.com> wrote:
>> +1  By the way, the JIRA for tracking (Scala) API parity is: https://issues.apache.org/jira/browse/SPARK-4591
>> 
>>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia <matei.zaharia@gmail.com>
wrote:
>>> This sounds good to me as well. The one thing we should pay attention to is how
we update the docs so that people know to start with the spark.ml classes. Right now the docs
list spark.mllib first and also seem more comprehensive in that area than in spark.ml, so
maybe people naturally move towards that.
>>> 
>>> Matei
>>> 
>>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng <meng@databricks.com> wrote:
>>>> 
>>>> Yes, DB (cc'ed) is working on porting the local linear algebra library over
(SPARK-13944). There are also frequent pattern mining algorithms we need to port over in order
to reach feature parity. -Xiangrui
>>>> 
>>>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <shivaram@eecs.berkeley.edu>
wrote:
>>>>> Overall this sounds good to me. One question I have is that in
>>>>> addition to the ML algorithms we have a number of linear algebra
>>>>> (various distributed matrices) and statistical methods in the
>>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>>> namespace in the 2.x series ?
>>>>> 
>>>>> Thanks
>>>>> Shivaram
>>>>> 
>>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <sowen@cloudera.com>
wrote:
>>>>> > FWIW, all of that sounds like a good plan to me. Developing one
API is
>>>>> > certainly better than two.
>>>>> >
>>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng <mengxr@gmail.com>
wrote:
>>>>> >> Hi all,
>>>>> >>
>>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline
API built
>>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
API has
>>>>> >> been developed under the spark.ml package, while the old RDD-based
API has
>>>>> >> been developed in parallel under the spark.mllib package. While
it was
>>>>> >> easier to implement and experiment with new APIs under a new
package, it
>>>>> >> became harder and harder to maintain as both packages grew bigger
and
>>>>> >> bigger. And new users are often confused by having two sets
of APIs with
>>>>> >> overlapped functions.
>>>>> >>
>>>>> >> We started to recommend the DataFrame-based API over the RDD-based
API in
>>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
development
>>>>> >> and the usage gradually shifting to the DataFrame-based API.
Just counting
>>>>> >> the lines of Scala code, from 1.5 to the current master we added
~10000
>>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based
API. So, to
>>>>> >> gather more resources on the development of the DataFrame-based
API and to
>>>>> >> help users migrate over sooner, I want to propose switching
RDD-based MLlib
>>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>>> >>
>>>>> >> * We do not accept new features in the RDD-based spark.mllib
package, unless
>>>>> >> they block implementing new features in the DataFrame-based
spark.ml
>>>>> >> package.
>>>>> >> * We still accept bug fixes in the RDD-based API.
>>>>> >> * We will add more features to the DataFrame-based API in the
2.x series to
>>>>> >> reach feature parity with the RDD-based API.
>>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
deprecate
>>>>> >> the RDD-based API.
>>>>> >> * We will remove the RDD-based API from the main Spark repo
in Spark 3.0.
>>>>> >>
>>>>> >> Though the RDD-based API is already in de facto maintenance
mode, this
>>>>> >> announcement will make it clear and hence important to both
MLlib developers
>>>>> >> and users. So we’d greatly appreciate your feedback!
>>>>> >>
>>>>> >> (As a side note, people sometimes use “Spark ML” to refer
to the
>>>>> >> DataFrame-based API or even the entire MLlib component. This
also causes
>>>>> >> confusion. To be clear, “Spark ML” is not an official name
and there are no
>>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>>> >>
>>>>> >> Best,
>>>>> >> Xiangrui
>>>>> >
>>>>> > ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>>> >
> 

Mime
View raw message