spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: possible bug in Spark's ALS implementation...
Date Wed, 12 Mar 2014 07:40:17 GMT
The mahout implementation is just a straight-forward port of the paper. 
No changes have been made.

On 03/12/2014 08:36 AM, Nick Pentreath wrote:
> It would be helpful to know what parameter inputs you are using.
>
> If the regularization schemes are different (by a factor of alpha, which
> can often be quite high) this will mean that the same parameter settings
> could give very different results. A higher lambda would be required with
> Spark's version to be comparable.
>
> When I submitted the PR for this, I verified (on ml-100k, ml-1m and ml-10m
> data) that this version gives the same RMSE as Mahout's implicit model, as
> well as a separate Spark version that I wrote that was a from-scratch port
> of the Mahout algorithm (though I didn't compare vs Myrrix/Oryx). I'm
> fairly confident things are correct but if there is a bug let's definitely
> find and fix it!
>
> @Sean, would it be a good idea to look at changing the regularization in
> Spark's ALS to alpha * lambda? What is the thinking behind this? If I
> recall, the Mahout version added something like (# ratings * lambda) as
> regularization in each factor update (for explicit), but implicit it was
> just lambda (I may be wrong here).
>
>
>
> On Wed, Mar 12, 2014 at 4:57 AM, Xiangrui Meng <mengxr@gmail.com> wrote:
>
>> Line 376 should be correct as it is computing \sum_i (c_i - 1) x_i
>> x_i^T, = \sum_i (alpha * r_i) x_i x_i^T. Are you computing some
>> metrics to tell which recommendation is better? -Xiangrui
>>
>> On Tue, Mar 11, 2014 at 6:38 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>>> Hi Michael,
>>>
>>> I can help check the current implementation. Would you please go to
>>> https://spark-project.atlassian.net/browse/SPARK and create a ticket
>>> about this issue with component "MLlib"? Thanks!
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman <msa@allman.ms> wrote:
>>>> Hi,
>>>>
>>>> I'm implementing a recommender based on the algorithm described in
>>>> http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms
>> the
>>>> basis for Spark's ALS implementation for data sets with implicit
>> features.
>>>> The data set I'm working with is proprietary and I cannot share it,
>> however
>>>> I can say that it's based on the same kind of data in the
>> paper---relative
>>>> viewing time of videos. (Specifically, the "rating" for each video is
>>>> defined as total viewing time across all visitors divided by video
>>>> duration).
>>>>
>>>> I'm seeing counterintuitive, sometimes nonsensical recommendations. For
>>>> comparison, I've run the training data through Oryx's in-VM
>> implementation
>>>> of implicit ALS with the same parameters. Oryx uses the same algorithm.
>>>> (Source in this file:
>>>>
>> https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java
>> )
>>>>
>>>> The recommendations made by each system compared to one other are very
>>>> different---moreso than I think could be explained by differences in
>> initial
>>>> state. The recommendations made by the Oryx models look much better,
>>>> especially as I increase the number of latent factors and the
>> iterations.
>>>> The Spark models' recommendations don't improve with increases in either
>>>> latent factors or iterations. Sometimes, they get worse.
>>>>
>>>> Because of the (understandably) highly-optimized and terse style of
>> Spark's
>>>> ALS implementation, I've had a very hard time following it well enough
>> to
>>>> debug the issue definitively. However, I have found a section of code
>> that
>>>> looks incorrect. As described in the paper, part of the implicit ALS
>>>> algorithm involves computing a matrix product YtCuY (equation 4 in the
>>>> paper). To optimize this computation, this expression is rewritten as
>> YtY +
>>>> Yt(Cu - I)Y. I believe that's what should be happening here:
>>>>
>>>>
>> https://github.com/apache/incubator-spark/blob/v0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376
>>>>
>>>> However, it looks like this code is in fact computing YtY + YtY(Cu - I),
>>>> which is the same as YtYCu. If so, that's a bug. Can someone familiar
>> with
>>>> this code evaluate my claim?
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>
>


Mime
View raw message