[ https://issues.apache.org/jira/browse/SPARK1503?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=14350812#comment14350812
]
Aaron Staple edited comment on SPARK1503 at 3/6/15 11:25 PM:

I recently created a PR for an implementation of accelerated gradient descent without backtracking,
as discussed above.
I also ran some simple, small scale (single server) benchmarks to compare different optimization
methods. (The benchmark result graphs are provided as images attached to this ticket.) The
optimization methods tested were:
 gra: existing gradient descent implementation (using full batch)
 acc: accelerated descent (as implemented in the PR), but without automatic restart
 acc_r: accelerated descent, with automatic restart
 acc_b: accelerated descent, with backtracking, without automatic restart
 acc_rb: accelerated descent, with backtracking, with automatic restart
 lbfgs: existing lbfgs implementation
(Note that backtracking is not part of the PR, and was just tested for a point of comparison.)
The x axis shows the number of outer loop iterations of the optimization algorithm. (Note
that for backtracking implementations, the full cost of backtracking is not represented in
this outer loop count. For non backtracking implementations, the number of outer loop iterations
is the same as the number of spark map reduce jobs). The y axis is the log of the difference
from best determined optimized value.
The optimization test runs were:
 linear: A scaled up version of the test data from TFOCS’s test_LASSO.m example, with 10000
observations on 1024 features. 512 of the features are actually correlated with result. Unregularized
linear regression was used. (The scala acceleration implementation was observed to be consistent
with the TFOCS implementation on this dataset.)
 linear l1: The same as ‘linear’, but with l1 regularization
 logistic: Each feature of each observation is generated by summing a feature gaussian specific
to the observation’s binary category with a noise gaussian. 10000 observations on 250 features.
Unregularized logistic regression was used.
 logistic l2: Same as ‘logistic’, but using l2 regularization
Note that for each test run, all optimization methods were given the same initial step size.
Observations:
 Acceleration consistently converges more quickly than standard gradient descent, given the
same initial step size.
 Automatic restart is helpful for acceleration convergence
 Backtracking can significantly boost convergence rates in some cases (measured in terms
of outer loop iterations). But the full cost of backtracking was not measured in these runs.
 lbfgs generally outperformed accelerated gradient descent in these test runs. Accelerated
gradient descent was competitive with lbfgs for linear l1 (lasso) regression. However as noted
in the documentation, the L1Updater “will not work” for the lbfgs implementation. It seems
l1 regularization is currently a weak spot for the lbfgs implementation.
was (Author: staple):
I recently created a PR for an implementation of accelerated gradient descent without backtracking,
as discussed above.
I also ran some simple, small scale (single server) benchmarks to compare different optimization
methods. (The benchmark result graphs are provided as images attached to this ticket.) The
optimization methods tested were:
 gra: existing gradient descent implementation (using full batch)
 acc: accelerated descent (as implemented in the PR), but without automatic restart
 acc_r: accelerated descent, with automatic restart
 acc_b: accelerated descent, with backtracking, without automatic restart
 acc_rb: accelerated descent, with backtracking, with automatic restart
 lbfgs: existing lbfgs implementation
(Note that backtracking is not part of the PR, and was just tested for.)
The x axis shows the number of outer loop iterations of the optimization algorithm. (Note
that for backtracking implementations, the full cost of backtracking is not represented in
this outer loop count. For non backtracking implementations, the number of outer loop iterations
is the same as the number of spark map reduce jobs). The y axis is the log of the difference
from best determined optimized value.
The optimization test runs were:
 linear: A scaled up version of the test data from TFOCS’s test_LASSO.m example, with 10000
observations on 1024 features. 512 of the features are actually correlated with result. Unregularized
linear regression was used. (The scala acceleration implementation was observed to be consistent
with the TFOCS implementation on this dataset.)
 linear l1: The same as ‘linear’, but with l1 regularization
 logistic: Each feature of each observation is generated by summing a feature gaussian specific
to the observation’s binary category with a noise gaussian. 10000 observations on 250 features.
Unregularized logistic regression was used.
 logistic l2: Same as ‘logistic’, but using l2 regularization
Note that for each test run, all optimization methods were given the same initial step size.
Observations:
 Acceleration consistently converges more quickly than standard gradient descent, given the
same initial step size.
 Automatic restart is helpful for acceleration convergence
 Backtracking can significantly boost convergence rates in some cases (measured in terms
of outer loop iterations). But the full cost of backtracking was not measured in these runs.
 lbfgs generally outperformed accelerated gradient descent in these test runs. Accelerated
gradient descent was competitive with lbfgs for linear l1 (lasso) regression. However as noted
in the documentation, the L1Updater “will not work” for the lbfgs implementation. It seems
l1 regularization is currently a weak spot for the lbfgs implementation.
> Implement Nesterov's accelerated firstorder method
> 
>
> Key: SPARK1503
> URL: https://issues.apache.org/jira/browse/SPARK1503
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Xiangrui Meng
> Assignee: Aaron Staple
> Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png
>
>
> Nesterov's accelerated firstorder method is a dropin replacement for steepest descent
but it converges much faster. We should implement this method and compare its performance
with existing algorithms, including SGD and LBFGS.
> TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and
its variants on composite objectives.

This message was sent by Atlassian JIRA
(v6.3.4#6332)

To unsubscribe, email: issuesunsubscribe@spark.apache.org
For additional commands, email: issueshelp@spark.apache.org
