I use MAP (mean average precision) http://en.wikipedia.org/wiki/Information_retrieval#Mean_average_precision
. Either with random held out data or more usually a timebased split. The oldest 90% for
train the the most recent 10% for test.
We looked at the power of purchases and views for predicting purchases and as you'd expect
found purchases far far more predictive. Therefore I do the following for a crossrecommender:
B = matrix purchases userId x itemId, 1=purchase, no value for no purchase.
A = matrix views userId x itemId
H_p = history of purchase vectors for each user, users as columns, so H_p = B'
[B'B]H_p = R_p = recommendations made from user history. This, in my case is actually the
result of using the itembased mahout recommender trained on purchases and using the similarity
= loglikelihood (we tried other sims and this worked best). The [B'B] is actually row similarity
using loglikelihood. Since B'B is symmetric the row similarity works. For B'A, which is not
symmetric, row similarity doesn't workthere are two matrixes. To use loglikelihood sparsifier
I'm looking at modifying the matrix multiply code.
[B'A]H_p = R_c = crossrecommendations based on cooccurrences of view with purchases.
We also calculate the similarity matrix in both cases and do a linear combination of 4 recommendations.
It sounds like we're using a different training set than you and only using views with cooccurrences.
If you measure MAP on a recommender built with views where you are calculating how well is
predicts known held out purchases vs training on purchases you will almost surely find that
views are much less predictive and unless you have a huge amount more view data than purchases
they will loose the comparison.
I haven't implemented the learned weights yet but the overhead is not an issue. Once you learn
the weights you only have to check them once in a great while for a given data source. They
are constants at all recommendation time.
BTW this can be used for other recommendations by chaining actions that have something in
common (users IDs?) To make things easier to account for we use a unified user Id and item
Id space for the action data, though it is not strictly necessary.
Some related docs
Have a look at Sean's notes here: http://myrrix.com/examplemergingrecommenders/
Ted's preso here: http://www.slideshare.net/tdunning/searchasrecommendation
On Jun 1, 2013, at 12:19 PM, Dominik Hübner <contact@dhuebner.com> wrote:
Thanks for those detailed responses!
So I assume that the problem of scaling the initial recommendations can be implicitly solved
by learning the weights for the linear combination. The log rank Ted mentioned seems to be
useful as well!
What metric do you use to evaluate your recommendations? I currently setup a cooccurrence
matrix to predict based on purchases only, a Purchase X View matrix and a view only matrix.
I am trying to recommend products leading people to purchases. So I somehow need to find a
way, to distinct between types recommendations and prefer "purchase" recommendations. Is there
any common approach to evaluate recommendations in such settings? Maybe I just didn't get
what you meant with: "In my case I weight the entire row vector, add the two, resort, and
calculate precision @ some number of recs."
Using a percentile rank as shown in the paper for implicit ALS might be useful here. The problem
nevertheless is, that most users only have a single purchase.
Furthermore, do you really perform gradient descent learning of the weights using hadoop/mahout?
Isn't this too costly to perform due to the overheads of the JVM and hadoop?
On Jun 1, 2013, at 1:21 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
> I've got a crossrecommender too. It was originally conceived to do a multiaction ensemble
from Ted's notes. I'm now gathering a new data set and building the metamodel learner.
>
> Even with the same scale you need to learn the weighting factors. Consider a simple ensemble
case:
>
> R_p is the matrix of all recommendations from an itembase recommender using history
for any single action (purchase?)
> S_p is the matrix of all similar items for a given item as measured by user actions (purchases?)
>
> It is highly likely that you can improve your ranking (also agree ratings are seldom
useful) by incorporating both sets of data. So R = R_p + aS_p This is the linear combo Ted
mentions. But what is "a"? The scale of the strengths here are very similar but their importance
is not. In my case I found that S_p was far more predictive of a future purchase than R_p
but without R_p every user sees the same recommendations for a given item. So some combo of
the two may prove far stronger in practice than either alone. The combo certainly has produced
a better precision score.
>
> Once you learn "a" things are much simpler but to learn it you need an objective function
like improving precision. Then you must find what weighting "a" gives the highest precision.
This requires virtually the entire of R_p and S_p. Since you don't know "a" you don't know
how much of the sorted R or S you need to combine. Using the Solr idea you would combine as
many results as you can get from the queries trying different weightings and measuring precision
each time. In my case I weight the entire row vector, add the two, resort, and calculate precision
@ some number of recs.
>
> Implementing a simple form of gradient ascent seems like a good approach because the
full ensemble is going to have more than just "a" and therefore need help converging across
multiple dimensions.
>
> The version of the crossrecommender I have considers item similarities as well as user
history based recommendations we have an ensemble something like:
>
> R_p + aS_p + bR_v + cS_v = R
> here R_v and S_v are from the cooccurrences of views with purchases, not just views.
>
> Ted's idea below about using the rank or log rank instead of whatever the similarity
metric returns as the "strength" seems like a really good one. It evens out the scales and
so may converge quicker if nothing else. Pretty easy to implement too.
>
>
> Repetition alert! Does anyone have a data set to try this on?
>
>
>
> On May 31, 2013, at 1:15 PM, Koobas <koobas@gmail.com> wrote:
>
> Ted,
> Thank you very much. This is very insightful.
> The log scaling is definitely an intuitive way of building the meta model.
> Not much disagreement about the uselessness of predicting ratings.
>
>
> On Fri, May 31, 2013 at 4:00 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>> In my case, I put all the indicators from all different sources in the same
>> Solr/Lucene index. Recommendations consists of making a single query to
>> Solr/Lucene with as much data as I have or want to include.
>>
>> At the point that this query is done, there are no weights on the
>> indicators ... merely presence or absence in a field or query. The weights
>> that I typically use are computed on the fly by Lucene's default similarity
>> score and the results tend to be very good. There is no issue of combining
>> scores on different scales since there is only one composite score.
>>
>> If you *really* want to build multiple models using different technologies
>> and combine them, you need a socalled metamodel. There are many ways to
>> build such a beast. A very simple way is to reduce all scores to quantiles
>> then to a logodds scale (taking care not to ever estimate a quantile as
>> either 0 or 1). A linear combination of these rescaled scores can work
>> pretty well although you do have to learn the linear weights.
>>
>> Sometimes scores vary strongly from query to query. In such cases,
>> reducing a score to being some kind of rank statistic can be helpful. For
>> instance, you may want to have a score that is the log of the rank that an
>> item appears at in the results list. You might also be able to normalize
>> scores based on properties of the query. Such rankbased or normalized
>> scores can then often be combined by any metamodel, including the one I
>> mentioned above.
>>
>> You should also look at the netflix papers, especially the one describing
>> the winning entry for more ideas on model combination. The major
>> difference there is that they were trying to predict a rating which is a
>> task that I find essentially useless since ranking is so much more
>> important in most realworld applications. Others may dispute my
>> assessment on this, of course.
>>
>> There are many ways of building the metamodel that you need, but one
>> overriding thought that I have is that the deviations from ideal in all
>> real cases will be large enough that theory should not be taken too
>> literally here, but rather should be used as a weak, though still useful,
>> inspirational guide.
>>
>>
>> On Fri, May 31, 2013 at 3:18 PM, Koobas <koobas@gmail.com> wrote:
>>
>>> I am also very interested in the answer to this question.
>>> Just to reiterate, if you use different recommenders, e.g.,
>>> kNN userbased, kNN itembased, ALS, each one produces
>>> recommendations on a different scale. So how do you combine them?
>>>
>>>
>>> On Fri, May 31, 2013 at 3:07 PM, Dominik Hübner <contact@dhuebner.com
>>>> wrote:
>>>
>>>> Hey,
>>>> I have implemented a cross recommender based on the approach Ted
>> Dunning
>>>> proposed (cannot find the original post, but here is a follow up
>>>> http://www.mailarchive.com/user@mahout.apache.org/msg12983.html).
>>>> Currently I am struggling with the last step of blending the initial
>>>> recommendations.
>>>>
>>>> My current approach:
>>>> 1. Compute a cooccurrence matrix for each useful combination of
>>>> userproduct interaction (e.g. which product views and purchased do
>>> appear
>>>> in common …)
>>>> 2. Perform initial recommendation based on each matrix and the required
>>>> type of user vector (e.g. a user's history of views OR purchases) (like
>>> the
>>>> itembased recommender implemented in Mahout)
>>>>
>>>> In step 2, I adapted the AggregateAndRecommendReducer of Mahout, which
>>>> normalizes vectors while building the sum of weighted similarities or
>> in
>>>> this case => cooccurrences.
>>>>
>>>> Now I end up with multiple recommendations for each product, but all of
>>>> them are on a different scale.
>>>> How can I convert them to have the same scale, in order to be able to
>>>> weight them and build the linear combinations of initial
>> recommendations
>>> as
>>>> Ted proposed?
>>>> Would it make sense to normalize user vectors (before multiplying) as
>>> well?
>>>>
>>>> Otherwise views would have a much higher influence than purchases due
>> to
>>>> their plain characteristics (they just appear way more frequently). Or
>> is
>>>> this the reason for weighting purchases higher and views lower? If so,
>> I
>>>> think it's sort of inconvenient. Wouldn't it be much more favorable to
>>> get
>>>> each type of interaction within the same scale and use the weights just
>>> to
>>>> control each types influence on the final recommendation?
>>>>
>>>> Thanks in advance for any suggestions!
>>>>
>>>>
>>>>
>>>> Regards
>>>> Dominik
>>>>
>>>> Sent from my iPhone
>>>
>>
>
