spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameet Talwalkar <am...@eecs.berkeley.edu>
Subject Re: what paper is the L2 regularization based on?
Date Sat, 18 Jan 2014 01:09:22 GMT
Sean, Walrus,

Great catch.  I think this is a bug in the code (see below for a comparison
of the current vs the correct code).  Also, here's another
link<http://cbcb.umd.edu/~hcorrada/PML/homeworks/HW04_solutions.pdf>describing
the derivation.

-Ameet

*CURRENT*
newWeights = weightsOld.sub(normGradient).div(2.0 * thisIterStepSize *
regParam + 1.0)

*CORRECT*
 newWeights = weightsOld.mul(1.0 - 2.0 * thisIterStepSize * regParam).sub(
normGradient)



On Thu, Jan 9, 2014 at 11:45 AM, Sean Owen <srowen@gmail.com> wrote:

> Yes, the regularization term just adds a bunch of (theta_i)^2 terms.
> The partial derivative with respect to theta_i is simply 2*theta_i
> since all the other new regularization terms are 0 w.r.t. theta_i. The
> regularization term just adds the weight vector itself to the gradient
> -- simples.
>
> ... give or take a factor of 2. To be fair there is minor variation in
> convention here; some put a factor of 1/2 in front of the L2
> regularization term to absorb the 2 in the partial derivatives, for
> tidiness. It doesn't matter in the sense that it's the same as using a
> lambda half as large, but then again, that does matter if you're
> trying to make apples-to-apples comparisons with another
> implementation.
>
> See about slide 20 here for some clear equations:
>
> http://people.cs.umass.edu/~sheldon/teaching/2012fa/ml/files/lec7-annotated.pdf
>
> And now I have basically the same question. I'm not sure I get how the
> code in Updater implements L2 regression. I see the
> weights-minus-gradient part, but the division by the scalar doesn't
> look right immediately. It looks like the shrinking term but then
> there should be a minus in there, and it ought to be a multiplier on
> the old weights only?
>
> Heh, if it's a slightly different definition, it would really make
> Walrus's point!
>
>
> On Thu, Jan 9, 2014 at 7:10 PM, Evan R. Sparks <evan.sparks@gmail.com>
> wrote:
> > Hi,
> >
> > The L2 update rule is derived from the derivative of the loss function
> with
> > respect to the model weights - an L2 regularized loss function contains
> an
> > additional additive term involving the weights. This paper provides some
> > useful mathematical background:
> > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.7377
> >
> > The code that computes the new L2 weight is here:
> >
> https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/Updater.scala#L90
> >
> > The compute function calculates the new weights based on the current
> weights
> > gradient as computed at each step. Contrast it with the code in the
> > SimpleUpdater class to get a sense for how the regularization parameter
> is
> > incorporated - it's fairly simple.
> >
> > In general, though, I agree it makes sense to include a discussion of the
> > algorithm and a reference to the specific version we implement in the
> > scaladoc.
> >
> > - Evan
> >
> >
> > On Thu, Jan 9, 2014 at 10:49 AM, Walrus theCat <walrusthecat@gmail.com>
> > wrote:
> >>
> >> No -- I'm not, and I appreciate the comment.  What I'm looking for is a
> >> specific mathematical formula that I can map to the source code.
> >>
> >> Personally, specifically, I'd like to see how the loss function gets
> >> embedded into the w (gradient), in the case of the regularized and
> >> unregularized operation.
> >>
> >> Looking through the source, the "loss history" makes sense to me, but I
> >> can't see how that translates into the effect on the gradient.
> >>
> >>
> >> On Thu, Jan 9, 2014 at 10:39 AM, Sean Owen <srowen@gmail.com> wrote:
> >>>
> >>> L2 regularization just means "regularizing by penalizing parameters
> >>> whose L2 norm is large", and L2 norm just means squared length. It's
> >>> not something you would write an ML paper on any more than what the
> >>> vector dot product is. Are you asking something else?
> >>>
> >>> On Thu, Jan 9, 2014 at 6:19 PM, Walrus theCat <walrusthecat@gmail.com>
> >>> wrote:
> >>> > Thanks Christopher,
> >>> >
> >>> > I wanted to know if there was a specific paper this particular
> codebase
> >>> > was
> >>> > based on.  For instance, Weka cites papers in their documentation.
> >>> >
> >>> >
> >>> > On Wed, Jan 8, 2014 at 7:10 PM, Christopher Nguyen <ctn@adatao.com>
> >>> > wrote:
> >>> >>
> >>> >> Walrus, given the question, this may be a good place for you to
> start.
> >>> >> There's some good discussion there as well as links to papers.
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> http://www.quora.com/Machine-Learning/What-is-the-difference-between-L1-and-L2-regularization
> >>> >>
> >>> >> Sent while mobile. Pls excuse typos etc.
> >>> >>
> >>> >> On Jan 8, 2014 2:24 PM, "Walrus theCat" <walrusthecat@gmail.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> Hi,
> >>> >>>
> >>> >>> Can someone point me to the paper that algorithm is based on?
> >>> >>>
> >>> >>> Thanks
> >>> >
> >>> >
> >>
> >>
> >
>

Mime
View raw message