spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Allman <...@allman.ms>
Subject possible bug in Spark's ALS implementation...
Date Tue, 11 Mar 2014 22:18:43 GMT
Hi,

I'm implementing a recommender based on the algorithm described in 
http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the 
basis for Spark's ALS implementation for data sets with implicit features. 
The data set I'm working with is proprietary and I cannot share it, 
however I can say that it's based on the same kind of data in the 
paper---relative viewing time of videos. (Specifically, the "rating" for 
each video is defined as total viewing time across all visitors divided by 
video duration).

I'm seeing counterintuitive, sometimes nonsensical recommendations. For 
comparison, I've run the training data through Oryx's in-VM implementation 
of implicit ALS with the same parameters. Oryx uses the same algorithm. 
(Source in this file: 
https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java)

The recommendations made by each system compared to one other are very 
different---moreso than I think could be explained by differences in 
initial state. The recommendations made by the Oryx models look much 
better, especially as I increase the number of latent factors and the 
iterations. The Spark models' recommendations don't improve with increases 
in either latent factors or iterations. Sometimes, they get worse.

Because of the (understandably) highly-optimized and terse style of 
Spark's ALS implementation, I've had a very hard time following it well 
enough to debug the issue definitively. However, I have found a section of 
code that looks incorrect. As described in the paper, part of the implicit 
ALS algorithm involves computing a matrix product YtCuY (equation 4 in the 
paper). To optimize this computation, this expression is rewritten as YtY 
+ Yt(Cu - I)Y. I believe that's what should be happening here:

https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376

However, it looks like this code is in fact computing YtY + YtY(Cu - I), 
which is the same as YtYCu. If so, that's a bug. Can someone familiar with 
this code evaluate my claim?

Cheers,

Michael

Mime
View raw message