Sean, Walrus,

Great catch. =A0I think this is a bu= g in the code (see below for a comparison of the current vs the correct cod= e). =A0Also, here's another link describing the derivation.

**CURRENT**

=
**CORRECT**

Great catch. =A0I think this is a bu= g in the code (see below for a comparison of the current vs the correct cod= e). =A0Also, here's another link describing the derivation.

-Ameet

newWeights =3D weightsOld<=
/span>.sub(normGradient).div(2.0 * thisIterStepSize * regParam +=
1.0)

newWeights =3D weightsOld.mul(1.0=A0- 2.0=
* thisIterSte=
pSize * regPar=
am).sub(=
normGradient)

=A0

=

--089e01634ec2412f9d04f034502b--
On Thu, Jan 9, 2014 at 11:45 AM, Sean Ow=
en <srowen@gmail.com> wrote:

Yes, the regularization term just adds a bun= ch of (theta_i)^2 terms.

The partial derivative with respect to theta_i is simply 2*theta_i

since all the other new regularization terms are 0 w.r.t. theta_i. The

regularization term just adds the weight vector itself to the gradient

-- simples.

... give or take a factor of 2. To be fair there is minor variation in

convention here; some put a factor of 1/2 in front of the L2

regularization term to absorb the 2 in the partial derivatives, for

tidiness. It doesn't matter in the sense that it's the same as usin= g a

lambda half as large, but then again, that does matter if you're

trying to make apples-to-apples comparisons with another

implementation.

See about slide 20 here for some clear equations:

http://people.cs.umass.edu/~sheldon/teac= hing/2012fa/ml/files/lec7-annotated.pdf

And now I have basically the same question. I'm not sure I get how the<= br> code in Updater implements L2 regression. I see the

weights-minus-gradient part, but the division by the scalar doesn't

look right immediately. It looks like the shrinking term but then

there should be a minus in there, and it ought to be a multiplier on

the old weights only?

Heh, if it's a slightly different definition, it would really make

Walrus's point!

On Thu, Jan 9, 2014 at 7:10 PM, Evan R. Sparks <evan.sparks@gmail.com> wrote:

> Hi,

>

> The L2 update rule is derived from the derivative of the loss function= with

> respect to the model weights - an L2 regularized loss function contain= s an

> additional additive term involving the weights. This paper provides so= me

> useful mathematical background:

> http://citeseerx.ist.psu.edu/viewdoc/summary?doi= =3D10.1.1.58.7377

>

> The code that computes the new L2 weight is here:

> https://github.com/apache/incubator-spark/blob/master/mllib/s= rc/main/scala/org/apache/spark/mllib/optimization/Updater.scala#L90

>

> The compute function calculates the new weights based on the current w= eights

> gradient as computed at each step. Contrast it with the code in the

> SimpleUpdater class to get a sense for how the regularization paramete= r is

> incorporated - it's fairly simple.

>

> In general, though, I agree it makes sense to include a discussion of = the

> algorithm and a reference to the specific version we implement in the<= br> > scaladoc.

>

> - Evan

>

>

> On Thu, Jan 9, 2014 at 10:49 AM, Walrus theCat <walrusthecat@gmail.com>

> wrote:

>>

>> No -- I'm not, and I appreciate the comment. =A0What I'm l= ooking for is a

>> specific mathematical formula that I can map to the source code.>>

>> Personally, specifically, I'd like to see how the loss functio= n gets

>> embedded into the w (gradient), in the case of the regularized and=

>> unregularized operation.

>>

>> Looking through the source, the "loss history" makes sen= se to me, but I

>> can't see how that translates into the effect on the gradient.=

>>

>>

>> On Thu, Jan 9, 2014 at 10:39 AM, Sean Owen <srowen@gmail.com> wrote:

>>>

>>> L2 regularization just means "regularizing by penalizing = parameters

>>> whose L2 norm is large", and L2 norm just means squared l= ength. It's

>>> not something you would write an ML paper on any more than wha= t the

>>> vector dot product is. Are you asking something else?

>>>

>>> On Thu, Jan 9, 2014 at 6:19 PM, Walrus theCat <walrusthecat@gmail.com>

>>> wrote:

>>> > Thanks Christopher,

>>> >

>>> > I wanted to know if there was a specific paper this parti= cular codebase

>>> > was

>>> > based on. =A0For instance, Weka cites papers in their doc= umentation.

>>> >

>>> >

>>> > On Wed, Jan 8, 2014 at 7:10 PM, Christopher Nguyen <ctn@adatao.com>

>>> > wrote:

>>> >>

>>> >> Walrus, given the question, this may be a good place = for you to start.

>>> >> There's some good discussion there as well as lin= ks to papers.

>>> >>

>>> >>

>>> >>

>>> >> http= ://www.quora.com/Machine-Learning/What-is-the-difference-between-L1-and-L2-= regularization

>>> >>

>>> >> Sent while mobile. Pls excuse typos etc.

>>> >>

>>> >> On Jan 8, 2014 2:24 PM, "Walrus theCat" <= ;walrusthecat@gmail.com>>>> >> wrote:

>>> >>>

>>> >>> Hi,

>>> >>>

>>> >>> Can someone point me to the paper that algorithm = is based on?

>>> >>>

>>> >>> Thanks

>>> >

>>> >

>>

>>

>