Sean, Walrus,

Great catch.  I think this is a bug in the code (see below for a comparison of the current vs the correct code).  Also, here's another link describing the derivation.

-Ameet

CURRENT
newWeights = weightsOld.sub(normGradient).div(2.0 * thisIterStepSize * regParam + 1.0)

CORRECT
newWeights = weightsOld.mul(1.0 - 2.0 * thisIterStepSize * regParam).sub(normGradient)
 


On Thu, Jan 9, 2014 at 11:45 AM, Sean Owen <srowen@gmail.com> wrote:
Yes, the regularization term just adds a bunch of (theta_i)^2 terms.
The partial derivative with respect to theta_i is simply 2*theta_i
since all the other new regularization terms are 0 w.r.t. theta_i. The
regularization term just adds the weight vector itself to the gradient
-- simples.

... give or take a factor of 2. To be fair there is minor variation in
convention here; some put a factor of 1/2 in front of the L2
regularization term to absorb the 2 in the partial derivatives, for
tidiness. It doesn't matter in the sense that it's the same as using a
lambda half as large, but then again, that does matter if you're
trying to make apples-to-apples comparisons with another
implementation.

See about slide 20 here for some clear equations:
http://people.cs.umass.edu/~sheldon/teaching/2012fa/ml/files/lec7-annotated.pdf

And now I have basically the same question. I'm not sure I get how the
code in Updater implements L2 regression. I see the
weights-minus-gradient part, but the division by the scalar doesn't
look right immediately. It looks like the shrinking term but then
there should be a minus in there, and it ought to be a multiplier on
the old weights only?

Heh, if it's a slightly different definition, it would really make
Walrus's point!


On Thu, Jan 9, 2014 at 7:10 PM, Evan R. Sparks <evan.sparks@gmail.com> wrote:
> Hi,
>
> The L2 update rule is derived from the derivative of the loss function with
> respect to the model weights - an L2 regularized loss function contains an
> additional additive term involving the weights. This paper provides some
> useful mathematical background:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.7377
>
> The code that computes the new L2 weight is here:
> https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/Updater.scala#L90
>
> The compute function calculates the new weights based on the current weights
> gradient as computed at each step. Contrast it with the code in the
> SimpleUpdater class to get a sense for how the regularization parameter is
> incorporated - it's fairly simple.
>
> In general, though, I agree it makes sense to include a discussion of the
> algorithm and a reference to the specific version we implement in the
> scaladoc.
>
> - Evan
>
>
> On Thu, Jan 9, 2014 at 10:49 AM, Walrus theCat <walrusthecat@gmail.com>
> wrote:
>>
>> No -- I'm not, and I appreciate the comment.  What I'm looking for is a
>> specific mathematical formula that I can map to the source code.
>>
>> Personally, specifically, I'd like to see how the loss function gets
>> embedded into the w (gradient), in the case of the regularized and
>> unregularized operation.
>>
>> Looking through the source, the "loss history" makes sense to me, but I
>> can't see how that translates into the effect on the gradient.
>>
>>
>> On Thu, Jan 9, 2014 at 10:39 AM, Sean Owen <srowen@gmail.com> wrote:
>>>
>>> L2 regularization just means "regularizing by penalizing parameters
>>> whose L2 norm is large", and L2 norm just means squared length. It's
>>> not something you would write an ML paper on any more than what the
>>> vector dot product is. Are you asking something else?
>>>
>>> On Thu, Jan 9, 2014 at 6:19 PM, Walrus theCat <walrusthecat@gmail.com>
>>> wrote:
>>> > Thanks Christopher,
>>> >
>>> > I wanted to know if there was a specific paper this particular codebase
>>> > was
>>> > based on.  For instance, Weka cites papers in their documentation.
>>> >
>>> >
>>> > On Wed, Jan 8, 2014 at 7:10 PM, Christopher Nguyen <ctn@adatao.com>
>>> > wrote:
>>> >>
>>> >> Walrus, given the question, this may be a good place for you to start.
>>> >> There's some good discussion there as well as links to papers.
>>> >>
>>> >>
>>> >>
>>> >> http://www.quora.com/Machine-Learning/What-is-the-difference-between-L1-and-L2-regularization
>>> >>
>>> >> Sent while mobile. Pls excuse typos etc.
>>> >>
>>> >> On Jan 8, 2014 2:24 PM, "Walrus theCat" <walrusthecat@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> Can someone point me to the paper that algorithm is based on?
>>> >>>
>>> >>> Thanks
>>> >
>>> >
>>
>>
>