spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Commented] (SPARK-5995) Make ML Prediction Developer APIs public
Date Mon, 02 Mar 2015 19:39:05 GMT


Joseph K. Bradley commented on SPARK-5995:

Pinging all people who commented on []:  [~sparks]
[~shivaram] [~lewuathe] [~srowen] [~tomerk] [~prudenko] [~mengxr]

If you have further thoughts about what other changes would make it easier for developer to
write new algorithms in, please discuss here!  I'll mull this over for a while before
making a PR.  Currently, the main change I'm planning on is the comment in the description
above about "Developers implement more basic transformation methods, such as features2raw,
raw2pred, raw2prob."  But if there are other useful changes, please say, even if it includes
removing some of the abstractions or functionality introduced in my previous PR.

Thanks in advance!

> Make ML Prediction Developer APIs public
> ----------------------------------------
>                 Key: SPARK-5995
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
> Previously, some Developer APIs were added to for classification and regression
to make it easier to add new algorithms and models: [SPARK-4789]  There are ongoing discussions
about the best design of the API.  This JIRA is to continue that discussion and try to finalize
those Developer APIs so that they can be made public.
> Please see [this design doc from SPARK-4789 |]
for details on the original API design.
> Some issues under debate:
> * Should there be strongly typed APIs for fit()?
> * Should the strongly typed API for transform() be public (vs. protected)?
> * What transformation methods should the API make developers implement for classification?
 (See details below.)
> * Should there be a way to transform a single Row (instead of only DataFrames)?
> More on "What transformation methods should the API make developers implement for classification?":
> * Goals:
> ** Optimize transform: Make it fast, and make it output only the desired columns.
> ** Easy development
> ** Support Classifier, Regressor, and ProbabilisticClassifier
> * (currently) Developers implement predictX methods for each output column X.  They may
override transform() to optimize speed.
> ** Pros: predictX is easy to understand.
> ** Cons: An optimized transform() is annoying to write.
> * Developers implement more basic transformation methods, such as features2raw, raw2pred,
> ** Pros: Abstract classes may implement optimized transform().
> ** Cons: Different types of predictors require different methods:
> *** Predictor and Regressor: features2pred
> *** Classifier: features2raw, raw2pred
> *** ProbabilisticClassifier: raw2prob
> * Developers implement a single predict() method which takes parameters for what columns
to output (returning tuple or some type with None for missing values).  Abstract classes take
the outputs they want and put them into columns.
> ** Pros: Developers only write 1 method and can optimize it as much as they want.  It
could be more optimized than the previous 2 options; e.g., if LogisticRegressionModel only
wants the prediction, then it never has to construct intermediate results such as the vector
of raw predictions.
> ** Cons: predict() will have a different signature for different abstractions, based
on the possible output columns.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message