# spark-user mailing list archives

##### Site index · List index
Message view
Top
From Nick Pentreath <nick.pentre...@gmail.com>
Subject Re: possible bug in Spark's ALS implementation...
Date Wed, 12 Mar 2014 07:36:59 GMT
It would be helpful to know what parameter inputs you are using.

If the regularization schemes are different (by a factor of alpha, which
can often be quite high) this will mean that the same parameter settings
could give very different results. A higher lambda would be required with
Spark's version to be comparable.

When I submitted the PR for this, I verified (on ml-100k, ml-1m and ml-10m
data) that this version gives the same RMSE as Mahout's implicit model, as
well as a separate Spark version that I wrote that was a from-scratch port
of the Mahout algorithm (though I didn't compare vs Myrrix/Oryx). I'm
fairly confident things are correct but if there is a bug let's definitely
find and fix it!

@Sean, would it be a good idea to look at changing the regularization in
Spark's ALS to alpha * lambda? What is the thinking behind this? If I
recall, the Mahout version added something like (# ratings * lambda) as
regularization in each factor update (for explicit), but implicit it was
just lambda (I may be wrong here).

On Wed, Mar 12, 2014 at 4:57 AM, Xiangrui Meng <mengxr@gmail.com> wrote:

> Line 376 should be correct as it is computing \sum_i (c_i - 1) x_i
> x_i^T, = \sum_i (alpha * r_i) x_i x_i^T. Are you computing some
> metrics to tell which recommendation is better? -Xiangrui
>
> On Tue, Mar 11, 2014 at 6:38 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
> > Hi Michael,
> >
> > I can help check the current implementation. Would you please go to
> > https://spark-project.atlassian.net/browse/SPARK and create a ticket
> > about this issue with component "MLlib"? Thanks!
> >
> > Best,
> > Xiangrui
> >
> > On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman <msa@allman.ms> wrote:
> >> Hi,
> >>
> >> I'm implementing a recommender based on the algorithm described in
> >> http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms
> the
> >> basis for Spark's ALS implementation for data sets with implicit
> features.
> >> The data set I'm working with is proprietary and I cannot share it,
> however
> >> I can say that it's based on the same kind of data in the
> paper---relative
> >> viewing time of videos. (Specifically, the "rating" for each video is
> >> defined as total viewing time across all visitors divided by video
> >> duration).
> >>
> >> I'm seeing counterintuitive, sometimes nonsensical recommendations. For
> >> comparison, I've run the training data through Oryx's in-VM
> implementation
> >> of implicit ALS with the same parameters. Oryx uses the same algorithm.
> >> (Source in this file:
> >>
> https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java
> )
> >>
> >> The recommendations made by each system compared to one other are very
> >> different---moreso than I think could be explained by differences in
> initial
> >> state. The recommendations made by the Oryx models look much better,
> >> especially as I increase the number of latent factors and the
> iterations.
> >> The Spark models' recommendations don't improve with increases in either
> >> latent factors or iterations. Sometimes, they get worse.
> >>
> >> Because of the (understandably) highly-optimized and terse style of
> Spark's
> >> ALS implementation, I've had a very hard time following it well enough
> to
> >> debug the issue definitively. However, I have found a section of code
> that
> >> looks incorrect. As described in the paper, part of the implicit ALS
> >> algorithm involves computing a matrix product YtCuY (equation 4 in the
> >> paper). To optimize this computation, this expression is rewritten as
> YtY +
> >> Yt(Cu - I)Y. I believe that's what should be happening here:
> >>
> >>
> https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376
> >>
> >> However, it looks like this code is in fact computing YtY + YtY(Cu - I),
> >> which is the same as YtYCu. If so, that's a bug. Can someone familiar
> with
> >> this code evaluate my claim?
> >>
> >> Cheers,
> >>
> >> Michael
>


Mime
View raw message