spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kotaro Tanahashi (JIRA)" <>
Subject [jira] [Commented] (SPARK-10324) MLlib 1.6 Roadmap
Date Thu, 01 Oct 2015 07:41:57 GMT


Kotaro Tanahashi commented on SPARK-10324:

I would like to add "Item Based Collaborative filtering" to the recommendation and implement
The computational cost of existing Collaborative filtering algorithm (ALS) increases as the
number of users increases. However, the mount of computation of  "Item Based Collaborative
filtering" increases with the number of items, so it works quickly and accurately.

item based Collaborative filtering is descrbed here.

> MLlib 1.6 Roadmap
> -----------------
>                 Key: SPARK-10324
>                 URL:
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Blocker
> Following SPARK-8445, we created this master list for MLlib features we plan to have
in Spark 1.6. Please view this list as a wish list rather than a concrete plan, because we
don't have an accurate estimate of available resources. Due to limited review bandwidth, features
appearing on this list will get higher priority during code review. But feel free to suggest
new items to the list in comments. We are experimenting with this process. Your feedback would
be greatly appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read
carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a [starter task|]
rather than a medium/big feature. Based on our experience, mixing the development process
with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when you start
working on some features. This is to avoid duplicate work. For small features, you don't need
to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned first before
coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for
a certain amount of time, the JIRA should be released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after
> * Remember to add `@Since("1.6.0")` annotation to new public APIs.
> * Please review others' PRs ( Code review greatly
helps improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link them properly.
> * Add "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping
a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and documentation
if necessary.
> h1. Roadmap (WIP)
> This is NOT [a complete list of MLlib JIRAs for 1.6|].
We only include umbrella JIRAs and high-level tasks.
> h2. Algorithms and performance
> * log-linear model for survival analysis (SPARK-8518)
> * normal equation approach for linear regression (SPARK-9834)
> * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
> * robust linear regression with Huber loss (SPARK-3181)
> * vector-free L-BFGS (SPARK-10078)
> * tree partition by features (SPARK-3717)
> * bisecting k-means (SPARK-6517)
> * weighted instance support (SPARK-9610)
> ** logistic regression (SPARK-7685)
> ** linear regression (SPARK-9642)
> ** random forest (SPARK-9478)
> * locality sensitive hashing (LSH) (SPARK-5992)
> * deep learning (SPARK-5575)
> ** autoencoder (SPARK-10408)
> ** restricted Boltzmann machine (RBM) (SPARK-4251)
> ** convolutional neural network (stretch)
> * factorization machine (SPARK-7008)
> * local linear algebra (SPARK-6442)
> * distributed LU decomposition (SPARK-8514)
> h2. Statistics
> * univariate statistics as UDAFs (SPARK-10384)
> * bivariate statistics as UDAFs (SPARK-10385)
> * R-like statistics for GLMs (SPARK-9835)
> * online hypothesis testing (SPARK-3147)
> h2. Pipeline API
> * pipeline persistence (SPARK-6725)
> * ML attribute API improvements (SPARK-8515)
> * feature transformers (SPARK-9930)
> ** feature interaction (SPARK-9698)
> ** SQL transformer (SPARK-8345)
> ** ??
> * predict single instance (SPARK-10413)
> * test Kaggle datasets (SPARK-9941)
> h2. Model persistence
> * PMML export
> ** naive Bayes (SPARK-8546)
> ** decision tree (SPARK-8542)
> * model save/load
> ** FPGrowth (SPARK-6724)
> ** PrefixSpan (SPARK-10386)
> * code generation
> ** decision tree and tree ensembles (SPARK-10387)
> h2. Data sources
> * LIBSVM data source (SPARK-10117)
> * public dataset loader (SPARK-10388)
> h2. Python API for ML
> The main goal of Python API is to have feature parity with Scala/Java API. You can find
a complete list [here|]. The tasks fall
into two major categories:
> * Python API for new algorithms
> * Python API for missing methods (Some listed in [SPARK-10022] and [SPARK-9663])
> h2. SparkR API for ML
> * support more families and link functions in SparkR::glm (SPARK-9838, SPARK-9839, SPARK-9840)
> * better R formula support (SPARK-9681)
> * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)
> h2. Documentation
> * re-organize user guide (SPARK-8517)
> * @Since versions in, pyspark.mllib, and (SPARK-7751)
> * automatically test example code in user guide (SPARK-10382)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message