spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hossein Falaki (JIRA)" <>
Subject [jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
Date Tue, 05 Jun 2018 03:52:00 GMT


Hossein Falaki commented on SPARK-24359:

[~shivaram] what prevents us from creating a tag like SparkML- and SparkML-
(or some other variant like that) in the main Spark repo? Also, if you think initially this
will be unclear, we don't have to submit SparkML to CRAN in its first release. Similar to
SparkR we can wait a bit until we are confident about its compatibility. Many users and distributions,
distribute SparkR from Apache rather than CRAN. One example is Databricks. We build SparkR
from source.

> SPIP: ML Pipelines in R
> -----------------------
>                 Key: SPARK-24359
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>    Affects Versions: 3.0.0
>            Reporter: Hossein Falaki
>            Priority: Major
>              Labels: SPIP
>         Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines in R-v3.pdf,
SparkML_ ML Pipelines in R.pdf
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly API|].
Since Spark 1.5 the (new) SparkML API which is based on [pipelines and parameters|] has
matured significantly. It allows users build and maintain complicated machine learning pipelines.
A lot of this functionality is difficult to expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as part of
Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s
pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in base and other
popular CRAN packages. We think adding more functions to SparkR will degrade usability and
make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr
includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive.
Also we propose a code-gen approach for this package to minimize work needed to expose future
MLlib API, but sparklyr’s API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following list of priorities.
The API choice that addresses a higher priority goal will be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future
ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. Between
consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual maintenance or
make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API as thin as
possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and they should
find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as the first
argument of the method:  do_something(obj, arg1, arg2). All functions are snake_case (e.g.,
{{spark_logistic_regression()}} and {{set_max_iter()}}). If a constructor gets arguments,
they will be named arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
> When calls need to be chained, like above example, syntax can nicely translate to a natural
pipeline style with help from very popular[ magrittr package|].
For example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package should be usable
without needing SparkR in the namespace. The package will introduce a number of S4 classes
that inherit from four basic classes. Here we will list the basic types with a few examples.
An object of any child class can be instantiated with a function call that starts with {{spark_}}.
> h2. Pipeline & PipelineStage
> A pipeline object contains one or more stages.  
> {code:java}
> > pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code}
> Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object
of type Pipeline.
> h2. Transformers
> A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame.
> *Example API:*
> {code:java}
> > tokenizer <- spark_tokenizer() %>%
>             set_input_col(“text”) %>%
>             set_output_col(“words”)
> > tokenized.df <- tokenizer %>% transform(df) {code}
> h2. Estimators
> An Estimator is an algorithm which can be fit on a SparkDataFrame to produce a Transformer.
E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
> *Example API:*
> {code:java}
> lr <- spark_logistic_regression() %>%
>             set_max_iter(10) %>%
>             set_reg_param(0.001){code}
> h2. Evaluators
> An evaluator computes metrics from predictions (model outputs) and returns a scalar metric.
> *Example API:*
> {code:java}
> lr.eval <- spark_regression_evaluator(){code}
> h2. Miscellaneous Classes
> MLlib pipelines have a variety of miscellaneous classes that serve as helpers and utilities.
For example an object of ParamGridBuilder is used to build a grid search pipeline. Another
example is ClusteringSummary.
> *Example API:*
> {code:java}
> > grid <- param_grid_builder() %>%
>             add_grid(reg_param(lr), c(0.1, 0.01)) %>%
>             add_grid(fit_intercept(lr), c(TRUE, FALSE)) %>%
>             add_grid(elastic_net_param(lr), c(0.0, 0.5, 1.0))
>  > model <- train_validation_split() %>%
>             set_estimator(lr) %>%
>             set_evaluator(spark_regression_evaluator()) %>%
>             set_estimator_param_maps(grid) %>%
>             set_train_ratio(0.8) %>%
>             set_parallelism(2) %>%
>             fit(){code}
> Pipeline Persistence
> SparkML package will fix a longstanding issue with SparkR model persistence SPARK-15572.
SparkML will directly wrap MLlib pipeline persistence API. 
> *API example:*
> {code:java}
> > model <- pipeline %>% fit(training)
> > model %>% spark_write_pipeline(overwrite = TRUE, path = “...”){code}
> h1. Design Sketch
> We propose using code generation from Scala to produce comprehensive API wrappers in
R. For more details please see the attached design document.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message