spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: possible bug in Spark's ALS implementation...
Date Wed, 12 Mar 2014 01:38:27 GMT
Hi Michael,

I can help check the current implementation. Would you please go to
https://spark-project.atlassian.net/browse/SPARK and create a ticket
about this issue with component "MLlib"? Thanks!

Best,
Xiangrui

On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman <msa@allman.ms> wrote:
> Hi,
>
> I'm implementing a recommender based on the algorithm described in
> http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the
> basis for Spark's ALS implementation for data sets with implicit features.
> The data set I'm working with is proprietary and I cannot share it, however
> I can say that it's based on the same kind of data in the paper---relative
> viewing time of videos. (Specifically, the "rating" for each video is
> defined as total viewing time across all visitors divided by video
> duration).
>
> I'm seeing counterintuitive, sometimes nonsensical recommendations. For
> comparison, I've run the training data through Oryx's in-VM implementation
> of implicit ALS with the same parameters. Oryx uses the same algorithm.
> (Source in this file:
> https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java)
>
> The recommendations made by each system compared to one other are very
> different---moreso than I think could be explained by differences in initial
> state. The recommendations made by the Oryx models look much better,
> especially as I increase the number of latent factors and the iterations.
> The Spark models' recommendations don't improve with increases in either
> latent factors or iterations. Sometimes, they get worse.
>
> Because of the (understandably) highly-optimized and terse style of Spark's
> ALS implementation, I've had a very hard time following it well enough to
> debug the issue definitively. However, I have found a section of code that
> looks incorrect. As described in the paper, part of the implicit ALS
> algorithm involves computing a matrix product YtCuY (equation 4 in the
> paper). To optimize this computation, this expression is rewritten as YtY +
> Yt(Cu - I)Y. I believe that's what should be happening here:
>
> https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376
>
> However, it looks like this code is in fact computing YtY + YtY(Cu - I),
> which is the same as YtYCu. If so, that's a bug. Can someone familiar with
> this code evaluate my claim?
>
> Cheers,
>
> Michael

Mime
View raw message