spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0
Date Wed, 01 Apr 2015 23:59:44 GMT
Ravi, we just merged https://issues.apache.org/jira/browse/SPARK-6642
and used the same lambda scaling as in 1.2. The change will be
included in Spark 1.3.1, which will be released soon. Thanks for
reporting this issue! -Xiangrui

On Tue, Mar 31, 2015 at 8:53 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
> I created a JIRA for this:
> https://issues.apache.org/jira/browse/SPARK-6637. Since we don't have
> a clear answer about how the scaling should be handled. Maybe the best
> solution for now is to switch back to the 1.2 scaling. -Xiangrui
>
> On Tue, Mar 31, 2015 at 2:50 PM, Sean Owen <sowen@cloudera.com> wrote:
>> Ah yeah I take your point. The squared error term is over the whole
>> user-item matrix, technically, in the implicit case. I suppose I am
>> used to assuming that the 0 terms in this matrix are weighted so much
>> less (because alpha is usually large-ish) that they're almost not
>> there, but they are. So I had just used the explicit formulation.
>>
>> I suppose the result is kind of scale invariant, but not exactly. I
>> had not prioritized this property since I had generally built models
>> on the full data set and not a sample, and had assumed that lambda
>> would need to be retuned over time as the input grew anyway.
>>
>> So, basically I don't know anything more than you do, sorry!
>>
>> On Tue, Mar 31, 2015 at 10:41 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>>> Hey Sean,
>>>
>>> That is true for explicit model, but not for implicit. The ALS-WR
>>> paper doesn't cover the implicit model. In implicit formulation, a
>>> sub-problem (for v_j) is:
>>>
>>> min_{v_j} \sum_i c_ij (p_ij - u_i^T v_j)^2 + lambda * X * \|v_j\|_2^2
>>>
>>> This is a sum for all i but not just the users who rate item j. In
>>> this case, if we set X=m_j, the number of observed ratings for item j,
>>> it is not really scale invariant. We have #users user vectors in the
>>> least squares problem but only penalize lambda * #ratings. I was
>>> suggesting using lambda * m directly for implicit model to match the
>>> number of vectors in the least squares problem. Well, this is my
>>> theory. I don't find any public work about it.
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Tue, Mar 31, 2015 at 5:17 AM, Sean Owen <sowen@cloudera.com> wrote:
>>>> I had always understood the formulation to be the first option you
>>>> describe. Lambda is scaled by the number of items the user has rated /
>>>> interacted with. I think the goal is to avoid fitting the tastes of
>>>> prolific users disproportionately just because they have many ratings
>>>> to fit. This is what's described in the ALS-WR paper we link to on the
>>>> Spark web site, in equation 5
>>>> (http://www.grappa.univ-lille3.fr/~mary/cours/stats/centrale/reco/paper/MatrixFactorizationALS.pdf)
>>>>
>>>> I think this also gets you the scale-invariance? For every additional
>>>> rating from user i to product j, you add one new term to the
>>>> squared-error sum, (r_ij - u_i . m_j)^2, but also, you'd increase the
>>>> regularization term by lambda * (|u_i|^2 + |m_j|^2)  They are at least
>>>> both increasing about linearly as ratings increase. If the
>>>> regularization term is multiplied by the total number of users and
>>>> products in the model, then it's fixed.
>>>>
>>>> I might misunderstand you and/or be speaking about something slightly
>>>> different when it comes to invariance. But FWIW I had always
>>>> understood the regularization to be multiplied by the number of
>>>> explicit ratings.
>>>>
>>>> On Mon, Mar 30, 2015 at 5:51 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>>>>> Okay, I didn't realize that I changed the behavior of lambda in 1.3.
>>>>> to make it "scale-invariant", but it is worth discussing whether this
>>>>> is a good change. In 1.2, we multiply lambda by the number ratings in
>>>>> each sub-problem. This makes it "scale-invariant" for explicit
>>>>> feedback. However, in implicit feedback model, a user's sub-problem
>>>>> contains all item factors. Then the question is whether we should
>>>>> multiply lambda by the number of explicit ratings from this user or by
>>>>> the total number of items. We used the former in 1.2 but changed to
>>>>> the latter in 1.3. So you should try a smaller lambda to get a similar
>>>>> result in 1.3.
>>>>>
>>>>> Sean and Shuo, which approach do you prefer? Do you know any existing
>>>>> work discussing this?
>>>>>
>>>>> Best,
>>>>> Xiangrui
>>>>>
>>>>>
>>>>> On Fri, Mar 27, 2015 at 11:27 AM, Xiangrui Meng <mengxr@gmail.com>
wrote:
>>>>>> This sounds like a bug ... Did you try a different lambda? It would
be
>>>>>> great if you can share your dataset or re-produce this issue on the
>>>>>> public dataset. Thanks! -Xiangrui
>>>>>>
>>>>>> On Thu, Mar 26, 2015 at 7:56 AM, Ravi Mody <rmody999@gmail.com>
wrote:
>>>>>>> After upgrading to 1.3.0, ALS.trainImplicit() has been returning
vastly
>>>>>>> smaller factors (and hence scores). For example, the first few
product's
>>>>>>> factor values in 1.2.0 are (0.04821, -0.00674,  -0.0325). In
1.3.0, the
>>>>>>> first few factor values are (2.535456E-8, 1.690301E-8, 6.99245E-8).
This
>>>>>>> difference of several orders of magnitude is consistent throughout
both user
>>>>>>> and product. The recommendations from 1.2.0 are subjectively
much better
>>>>>>> than in 1.3.0. 1.3.0 trains significantly faster than 1.2.0,
and uses less
>>>>>>> memory.
>>>>>>>
>>>>>>> My first thought is that there is too much regularization in
the 1.3.0
>>>>>>> results, but I'm using the same lambda parameter value. This
is a snippet of
>>>>>>> my scala code:
>>>>>>> .....
>>>>>>> val rank = 75
>>>>>>> val numIterations = 15
>>>>>>> val alpha = 10
>>>>>>> val lambda = 0.01
>>>>>>> val model = ALS.trainImplicit(train_data, rank, numIterations,
>>>>>>> lambda=lambda, alpha=alpha)
>>>>>>> .....
>>>>>>>
>>>>>>> The code and input data are identical across both versions. Did
anything
>>>>>>> change between the two versions I'm not aware of? I'd appreciate
any help!
>>>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message