flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1807) Stochastic gradient descent optimizer for ML library
Date Thu, 07 May 2015 09:19:02 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532306#comment-14532306

ASF GitHub Bot commented on FLINK-1807:

Github user tillrohrmann commented on a diff in the pull request:

    --- Diff: docs/libs/ml/optimization.md ---
    @@ -0,0 +1,222 @@
    +mathjax: include
    +title: "ML - Optimization"
    +displayTitle: <a href="index.md">ML</a> - Optimization
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements.  See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership.  The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License.  You may obtain a copy of the License at
    +  http://www.apache.org/licenses/LICENSE-2.0
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +KIND, either express or implied.  See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +* Table of contents
    +## Mathematical Formulation
    +The optimization framework in Flink is a developer-oriented package that can be used
to solve
    +problems common in Machine Learning (ML) tasks. In the supervised learning context, this
    +involves finding a model, as defined by a set of parameters $w$, that minimize a function
    +given a set of $(\x, y)$ examples,
    +where $\x$ is a feature vector and $y$ is a real number, which can represent either a
real value in 
    +the regression case, or a class label in the classification case. In supervised learning,
    +function to be minimized is usually of the form:
    +    f(\wv) := 
    +    \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) +
    +    \lambda\, R(\wv)
    +    \label{eq:objectiveFunc}
    +    \ .
    +where $L$ is the loss function and $R(\wv)$ the regularization penalty. We use $L$ to
measure how
    +well the model fits the observed data, and we use $R$ in order to impose a complexity
cost to the
    +model, with $\lambda > 0$ being the regularization parameter.
    +### Loss Functions
    +In supervised learning, we use loss functions in order to measure the model fit, by 
    +penalizing errors in the predictions $p$ made by the model compared to the true $y$ for
    +example. Different loss function can be used for regression (e.g. Squared Loss) and classification
    +(e.g. Hinge Loss).
    +Some common loss functions are:
    +* Squared Loss: $ \frac{1}{2} (\wv^T \x - y)^2, \quad y \in \R $ 
    +* Hinge Loss: $ \max (0, 1 - y ~ \wv^T \x), \quad y \in \{-1, +1\} $
    +* Logistic Loss: $ \log(1+\exp( -y ~ \wv^T \x)), \quad y \in \{-1, +1\} $
    +Currently, only the Squared Loss function is implemented in Flink.
    +### Regularization Types
    +[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) in machine
learning is 
    +imposes penalties to the estimated models, in order to reduce overfitting. The most common
    +are the $L_1$ and $L_2$ penalties, defined as:
    +* $L_1$: $R(\wv) = \|\wv\|_1$
    +* $L_2$: $R(\wv) = \frac{1}{2}\|\wv\|_2^2$
    +The $L_2$ penalty penalizes large weights, favoring solutions with more small weights
rather than
    +few large ones.
    +The $L_1$ penalty can be used to drive a number of the solution coefficients to 0, thereby
    +producing sparse solutions.
    +The optimization framework in Flink supports the $L_1$ and $L_2$ penalties, as well as
    +regularization. The 
    +regularization parameter $\lambda$ in $\eqref{objectiveFunc}$ determines the amount of

    +regularization applied to the model,
    +and is usually determined through model cross-validation.
    +## Stochastic Gradient Descent
    +In order to find a (local) minimum of a function, Gradient Descent methods take steps
in the
    +direction opposite to the gradient of the function $\eqref{objectiveFunc}$ taken with
    +respect to the current parameters (weights).
    +In order to compute the exact gradient we need to perform one pass through all the points
    +a dataset, making the process computationally expensive.
    +An alternative is Stochastic Gradient Descent (SGD) where at each iteration we sample
one point
    +from the complete dataset and update the parameters for each point, in an online manner.
    +In mini-batch SGD we instead sample random subsets of the dataset, and compute the gradient
    +over each batch. At each iteration of the algorithm we update the weights once, based
    +the average of the gradients computed from each mini-batch.
    +An important parameter is the learning rate $\eta$, or step size, which is currently
determined as
    +$\eta = \frac{\eta_0}{\sqrt{j}}$, where $\eta_0$ is the initial step size and $j$ is
the iteration 
    +number. The setting of the initial step size can significantly affect the performance
of the 
    +algorithm. For some practical tips on tuning SGD see Leon Botou's 
    +"[Stochastic Gradient Descent Tricks](http://research.microsoft.com/pubs/192769/tricks-2012.pdf)".
    +The current implementation of SGD  uses the whole partition, making it 
    +effectively a batch gradient descent. Once a sampling operator has been introduced in
Flink, true
    +mini-batch SGD will be performed.
    +### Parameters
    +  The stochastic gradient descent implementation can be controlled by the following parameters:
    +   <table class="table table-bordered">
    +    <thead>
    +      <tr>
    +        <th class="text-left" style="width: 20%">Parameter</th>
    +        <th class="text-center">Description</th>
    +      </tr>
    +    </thead>
    +    <tbody>
    +      <tr>
    +        <td><strong>Loss Function</strong></td>
    +        <td>
    +          <p>
    +            The class of the loss function to be used. (Default value: 
    +            <strong>SquaredLoss</strong>, used for regression tasks)
    +          </p>
    +        </td>
    +      </tr>
    +      <tr>
    +        <td><strong>RegularizationType</strong></td>
    +        <td>
    +          <p>
    +            The type of regularization penalty to apply. (Default value: 
    +            <strong>NoRegularization</strong>)
    +          </p>
    +        </td>
    +      </tr>
    +      <tr>
    +        <td><strong>RegularizationParameter</strong></td>
    +        <td>
    +          <p>
    +            The amount of regularization to apply. (Default value:<strong>0</strong>)
    +          </p>
    +        </td>
    +      </tr>     
    +      <tr>
    +        <td><strong>Iterations</strong></td>
    +        <td>
    +          <p>
    +            The maximum number of iterations. (Default value: <strong>10</strong>)
    +          </p>
    +        </td>
    +      </tr>
    +      <tr>
    +        <td><strong>Stepsize</strong></td>
    +        <td>
    +          <p>
    +            Initial step size for the gradient descent method.
    +            This value controls how far the gradient descent method moves in the opposite
direction of the gradient.
    +            (Default value: <strong>0.1</strong>)
    +          </p>
    +        </td>
    +      </tr>
    +    </tbody>
    +  </table>
    +### Examples
    +In the Flink implementation of SGD, given a set of examples in a `DataSet[LabeledVector]`
    +optionally some initial weights, we can use `GradientDescent.optimize()` in order to
    +the weights for the given data.
    +The user can provide an initial `DataSet[WeightVector]`,
    +which contains one `WeightVector` element, or use the default weights which are all set
to 0.
    +A `WeightVector` is a container class for the weights, which separates the intercept
from the
    +weight vector. This allows us to avoid applying regularization to the intercept.
    +{% highlight scala %}
    +// Create stochastic gradient descent solver
    +val sgd = GradientDescent()
    +.setLossFunction(new SquaredLoss)
    +.setRegularizationType(new L1Regularization)
    +// Obtain data
    +val trainingDS: DataSet[LabeledVector] = ...
    +// Fit the solver to the provided data, using initial weights set to 0.0
    +val weightDS = sgd.optimize(inputDS, None)
    +// Retrieve the optimized weights
    +val weightVector = weightDS.collect().head
    --- End diff --
    Usually, we would want to keep the weight vector in a data set to be broadcasted in the
prediction step.

> Stochastic gradient descent optimizer for ML library
> ----------------------------------------------------
>                 Key: FLINK-1807
>                 URL: https://issues.apache.org/jira/browse/FLINK-1807
>             Project: Flink
>          Issue Type: Improvement
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Theodore Vasiloudis
>              Labels: ML
> Stochastic gradient descent (SGD) is a widely used optimization technique in different
ML algorithms. Thus, it would be helpful to provide a generalized SGD implementation which
can be instantiated with the respective gradient computation. Such a building block would
make the development of future algorithms easier.

This message was sent by Atlassian JIRA

View raw message