spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators
Date Mon, 27 Feb 2017 07:50:45 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885298#comment-15885298
] 

Nick Pentreath commented on SPARK-19747:
----------------------------------------

Big +1 for this! I agree we really should be able to make all the concrete implementations
simply specify the specific aggregation part - effectively the loss.

The general approach sounds good to me.

> Consolidate code in ML aggregators
> ----------------------------------
>
>                 Key: SPARK-19747
>                 URL: https://issues.apache.org/jira/browse/SPARK-19747
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Seth Hendrickson
>            Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable loss function
over a parameter vector. We implement these by having a loss function accumulate the gradient
using an Aggregator class which has methods that amount to a {{seqOp}} and {{combOp}}. So,
pretty much every algorithm that obeys this form implements a cost function class and an aggregator
class, which are completely separate from one another but share probably 80% of the same code.

> I think it is important to clean things like this up, and if we can do it properly it
will make the code much more maintainable, readable, and bug free. It will also help reduce
the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to implement the
{{add}} function. This is really the only difference in the current aggregators.
> 2. Have a single, generic cost function that is parameterized by the aggregator type.
This reduces the many places we implement cost functions and greatly reduces the amount of
duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message