spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yuhao yang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-19747) Consolidate code in ML aggregators
Date Tue, 28 Feb 2017 21:58:45 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887421#comment-15887421
] 

yuhao yang edited comment on SPARK-19747 at 2/28/17 9:58 PM:
-------------------------------------------------------------

I did notice the code duplication during implementing LinearSVC. Glad to see you already started
on this. 

While the "loss" clearly can be extracted, we can also perhaps make it more generic and support
interchangeable penalty, learning_rate, or even optimizer.

* "penalty" (‘none’, ‘l2’, ‘l1’, or ‘elasticnet’),

*  "learning_rate"
**  ‘constant’: eta = eta0
**  ‘optimal’: eta = 1.0 / (alpha * (t + t0)) 
**  ‘invscaling’: eta = eta0 / pow(t, power_t)

* optimizer
** SGD
** LBFGS
** OWLQN etc.

After the generic framework is developed, we can gradually migrate existing implementations.
I was working on a generic SGDClassifier but there're some tricky issues about feature standardization,
intercept and multi-class support. 


was (Author: yuhaoyan):
I did notice the code duplication during implementing LinearSVC. Glad to see you already started
on this. 

While the "loss" clearly can be extracted, we can also perhaps make it more generic and support
interchangeable penalty, learning_rate, or even optimizer.

* "penalty" (‘none’, ‘l2’, ‘l1’, or ‘elasticnet’),

*  "learning_rate"
**  ‘constant’: eta = eta0
**  ‘optimal’: eta = 1.0 / (alpha * (t + t0)) 
**  ‘invscaling’: eta = eta0 / pow(t, power_t)

* optimizer
** SGD
** LBFGS
** OWLQN etc.

After the generic framework is developed, we can gradually migrate existing implementations.
I'm working on a generic SGDClassifier but there're some tricky issues about feature standardization,
intercept and multi-class support. 

> Consolidate code in ML aggregators
> ----------------------------------
>
>                 Key: SPARK-19747
>                 URL: https://issues.apache.org/jira/browse/SPARK-19747
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Seth Hendrickson
>            Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable loss function
over a parameter vector. We implement these by having a loss function accumulate the gradient
using an Aggregator class which has methods that amount to a {{seqOp}} and {{combOp}}. So,
pretty much every algorithm that obeys this form implements a cost function class and an aggregator
class, which are completely separate from one another but share probably 80% of the same code.

> I think it is important to clean things like this up, and if we can do it properly it
will make the code much more maintainable, readable, and bug free. It will also help reduce
the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to implement the
{{add}} function. This is really the only difference in the current aggregators.
> 2. Have a single, generic cost function that is parameterized by the aggregator type.
This reduces the many places we implement cost functions and greatly reduces the amount of
duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message