spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-12626) MLlib 2.0 Roadmap
Date Tue, 26 Jan 2016 20:57:39 GMT

     [ https://issues.apache.org/jira/browse/SPARK-12626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiangrui Meng updated SPARK-12626:
----------------------------------
    Description: 
This is a master list for MLlib improvements we plan to have in Spark 2.0. Please view this
list as a wish list rather than a concrete plan, because we don't have an accurate estimate
of available resources. Due to limited review bandwidth, features appearing on this list will
get higher priority during code review. But feel free to suggest new items to the list in
comments. We are experimenting with this process. Your feedback would be greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully.
Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209]
rather than a medium/big feature. Based on our experience, mixing the development process
with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when you start working
on some features. This is to avoid duplicate work. For small features, you don't need to wait
to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned first before
coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for
a certain amount of time, the JIRA should be released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another.
* Remember to add the `@Since("2.0.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps
to improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link them properly.
* Add a "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a
maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and documentation if
applicable.

h1. Roadmap (*WIP*)

This is NOT [a complete list of MLlib JIRAs for 2.0|https://issues.apache.org/jira/issues/?filter=12334385].
We only include umbrella JIRAs and high-level tasks.

Major efforts in this release:
* `spark.ml`: Achieve feature parity for the `spark.ml` API, relative to the `spark.mllib`
API.  This includes the Python API.
* Linear algebra: Separate out the linear algebra library as a standalone project without
a Spark dependency to simplify production deployment.
* Pipelines API: Complete critical improvements to the Pipelines API
* New features: As usual, we expect to expand the feature set of MLlib.  However, we will
prioritize API parity over new features.  _New algorithms should be written for `spark.ml`,
not `spark.mllib`._

h2. Algorithms and performance

* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* estimator interface for GLMs (SPARK-12811)
* extended support for GLM model families and link functions in SparkR (SPARK-12566)
* improved model summaries and stats via IRLS (SPARK-9837)

Additional (maybe lower priority):
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* local linear algebra (SPARK-6442)
* weighted instance support (SPARK-9610)
** random forest (SPARK-9478)
** GBT (SPARK-9612)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-5575)
** autoencoder (SPARK-10408)
** restricted Boltzmann machine (RBM) (SPARK-4251)
** convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* sketch algorithms (cross listed) : approximate quantiles (SPARK-6761), count-min sketch
(SPARK-6763), Bloom filter (SPARK-12818)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
** trees (SPARK-11888)
** RFormula (SPARK-11891)
** MLC (SPARK-11871)
** PySpark (SPARK-11939)
* ML attribute API improvements (SPARK-8515)
* predict single instance (SPARK-10413)
* test Kaggle datasets (SPARK-9941)

_There may be other design improvement efforts for Pipelines, to be listed here soon.  See
(SPARK-5874) for a list of possibilities._

h2. Model persistence

* PMML export
** naive Bayes (SPARK-8546)
** decision tree (SPARK-8542)
* model save/load
** FPGrowth (SPARK-6724)
** PrefixSpan (SPARK-10386)
* code generation
** decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API. You can find a
complete list [here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall
into two major categories:

* Pipeline persistence in PySpark (SPARK-11939)
* Python API for missing methods (SPARK-11937)
* Python API for new algorithms. Committers should create a JIRA for the Python API after
merging a public feature in Scala/Java.

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-12566)
* model summary with R-like statistics for GLMs (SPARK-9837)
* support more algorithms (k-means (SPARK-13011), survival analysis (SPARK-13010), etc.)

h2. Documentation

* re-organize user guide (SPARK-8517)
* make example code testable in user guide (SPARK-11337)
* @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
* Fix param format in pydoc (SPARK-11219)

  was:
This is a master list for MLlib improvements we plan to have in Spark 2.0. Please view this
list as a wish list rather than a concrete plan, because we don't have an accurate estimate
of available resources. Due to limited review bandwidth, features appearing on this list will
get higher priority during code review. But feel free to suggest new items to the list in
comments. We are experimenting with this process. Your feedback would be greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully.
Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209]
rather than a medium/big feature. Based on our experience, mixing the development process
with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when you start working
on some features. This is to avoid duplicate work. For small features, you don't need to wait
to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned first before
coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for
a certain amount of time, the JIRA should be released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another.
* Remember to add the `@Since("2.0.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps
to improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link them properly.
* Add a "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a
maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and documentation if
applicable.

h1. Roadmap (*WIP*)

This is NOT [a complete list of MLlib JIRAs for 2.0|https://issues.apache.org/jira/issues/?filter=12334385].
We only include umbrella JIRAs and high-level tasks.

Major efforts in this release:
* `spark.ml`: Achieve feature parity for the `spark.ml` API, relative to the `spark.mllib`
API.  This includes the Python API.
* Linear algebra: Separate out the linear algebra library as a standalone project without
a Spark dependency to simplify production deployment.
* Pipelines API: Complete critical improvements to the Pipelines API
* New features: As usual, we expect to expand the feature set of MLlib.  However, we will
prioritize API parity over new features.  _New algorithms should be written for `spark.ml`,
not `spark.mllib`._

h2. Algorithms and performance

* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* estimator interface for GLMs (SPARK-12811)
* extended support for GLM model families and link functions in SparkR (SPARK-12566)
* improved model summaries and stats via IRLS (SPARK-9837)

Additional (maybe lower priority):
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* local linear algebra (SPARK-6442)
* weighted instance support (SPARK-9610)
** random forest (SPARK-9478)
** GBT (SPARK-9612)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-5575)
** autoencoder (SPARK-10408)
** restricted Boltzmann machine (RBM) (SPARK-4251)
** convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* sketch algorithms (cross listed) : approximate quantiles (SPARK-6761), count-min sketch
(SPARK-6763), Bloom filter (SPARK-12818)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
** trees (SPARK-11888)
** RFormula (SPARK-11891)
** MLC (SPARK-11871)
** PySpark (SPARK-11939)
* ML attribute API improvements (SPARK-8515)
* predict single instance (SPARK-10413)
* test Kaggle datasets (SPARK-9941)

_There may be other design improvement efforts for Pipelines, to be listed here soon.  See
(SPARK-5874) for a list of possibilities._

h2. Model persistence

* PMML export
** naive Bayes (SPARK-8546)
** decision tree (SPARK-8542)
* model save/load
** FPGrowth (SPARK-6724)
** PrefixSpan (SPARK-10386)
* code generation
** decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API. You can find a
complete list [here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall
into two major categories:

* Pipeline persistence in PySpark (SPARK-11939)
* Python API for missing methods (SPARK-11937)
* Python API for new algorithms. Committers should create a JIRA for the Python API after
merging a public feature in Scala/Java.

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-12566)
* model summary with R-like statistics for GLMs (SPARK-9837)
* support more algorithms (K-Means, survival analysis (SPARK-13010), etc.)

h2. Documentation

* re-organize user guide (SPARK-8517)
* make example code testable in user guide (SPARK-11337)
* @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
* Fix param format in pydoc (SPARK-11219)


> MLlib 2.0 Roadmap
> -----------------
>
>                 Key: SPARK-12626
>                 URL: https://issues.apache.org/jira/browse/SPARK-12626
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Xiangrui Meng
>            Priority: Blocker
>              Labels: roadmap
>
> This is a master list for MLlib improvements we plan to have in Spark 2.0. Please view
this list as a wish list rather than a concrete plan, because we don't have an accurate estimate
of available resources. Due to limited review bandwidth, features appearing on this list will
get higher priority during code review. But feel free to suggest new items to the list in
comments. We are experimenting with this process. Your feedback would be greatly appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209]
rather than a medium/big feature. Based on our experience, mixing the development process
with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when you start
working on some features. This is to avoid duplicate work. For small features, you don't need
to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned first before
coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for
a certain amount of time, the JIRA should be released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after
another.
> * Remember to add the `@Since("2.0.0")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly
helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping
a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and documentation
if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.0|https://issues.apache.org/jira/issues/?filter=12334385].
We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * `spark.ml`: Achieve feature parity for the `spark.ml` API, relative to the `spark.mllib`
API.  This includes the Python API.
> * Linear algebra: Separate out the linear algebra library as a standalone project without
a Spark dependency to simplify production deployment.
> * Pipelines API: Complete critical improvements to the Pipelines API
> * New features: As usual, we expect to expand the feature set of MLlib.  However, we
will prioritize API parity over new features.  _New algorithms should be written for `spark.ml`,
not `spark.mllib`._
> h2. Algorithms and performance
> * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
> * estimator interface for GLMs (SPARK-12811)
> * extended support for GLM model families and link functions in SparkR (SPARK-12566)
> * improved model summaries and stats via IRLS (SPARK-9837)
> Additional (maybe lower priority):
> * robust linear regression with Huber loss (SPARK-3181)
> * vector-free L-BFGS (SPARK-10078)
> * tree partition by features (SPARK-3717)
> * local linear algebra (SPARK-6442)
> * weighted instance support (SPARK-9610)
> ** random forest (SPARK-9478)
> ** GBT (SPARK-9612)
> * locality sensitive hashing (LSH) (SPARK-5992)
> * deep learning (SPARK-5575)
> ** autoencoder (SPARK-10408)
> ** restricted Boltzmann machine (RBM) (SPARK-4251)
> ** convolutional neural network (stretch)
> * factorization machine (SPARK-7008)
> * distributed LU decomposition (SPARK-8514)
> h2. Statistics
> * bivariate statistics as UDAFs (SPARK-10385)
> * R-like statistics for GLMs (SPARK-9835)
> * sketch algorithms (cross listed) : approximate quantiles (SPARK-6761), count-min sketch
(SPARK-6763), Bloom filter (SPARK-12818)
> h2. Pipeline API
> * pipeline persistence (SPARK-6725)
> ** trees (SPARK-11888)
> ** RFormula (SPARK-11891)
> ** MLC (SPARK-11871)
> ** PySpark (SPARK-11939)
> * ML attribute API improvements (SPARK-8515)
> * predict single instance (SPARK-10413)
> * test Kaggle datasets (SPARK-9941)
> _There may be other design improvement efforts for Pipelines, to be listed here soon.
 See (SPARK-5874) for a list of possibilities._
> h2. Model persistence
> * PMML export
> ** naive Bayes (SPARK-8546)
> ** decision tree (SPARK-8542)
> * model save/load
> ** FPGrowth (SPARK-6724)
> ** PrefixSpan (SPARK-10386)
> * code generation
> ** decision tree and tree ensembles (SPARK-10387)
> h2. Data sources
> * public dataset loader (SPARK-10388)
> h2. Python API for ML
> The main goal of Python API is to have feature parity with Scala/Java API. You can find
a complete list [here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall
into two major categories:
> * Pipeline persistence in PySpark (SPARK-11939)
> * Python API for missing methods (SPARK-11937)
> * Python API for new algorithms. Committers should create a JIRA for the Python API after
merging a public feature in Scala/Java.
> h2. SparkR API for ML
> * support more families and link functions in SparkR::glm (SPARK-12566)
> * model summary with R-like statistics for GLMs (SPARK-9837)
> * support more algorithms (k-means (SPARK-13011), survival analysis (SPARK-13010), etc.)
> h2. Documentation
> * re-organize user guide (SPARK-8517)
> * make example code testable in user guide (SPARK-11337)
> * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
> * Fix param format in pydoc (SPARK-11219)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message