I am using [B'A]H_v in my code as well.
Furthermore, the basics of my implementation are done so far (a more "general purpose" reimplementation
of the itembased recommender) and I soon will move on to evaluate the approach with my dataset.
On the one hand I'm quite lucky since each of the interactions has a timestamp, on the other
hand the majority of users only has 530 views. My implementation supports a decay (using
a factor like 1/(1+e^$TIME_PASSED) ) of the view count, which smoothes truncation a bit.
Nevertheless, I am still not sure if precision is the metric to use for evaluation. Since
most user's will only get presented or look at the first view recommendations, the position
or rank of the recommended items might be more important. (see page 6 of http://www2.research.att.com/~yifanhu/PUB/cf.pdf).
At least to optimise the weights of the linear combination of initial recommendation, using
the rank intuitively seems more suitable to me.
Of course, optimising weights might become a bit tricky and will require some sort of grid
search for good parameters (which might be sped up by using an evolutionary algorithm picking
the best intermediate solutions).
Since, most of what I wrote above about evaluation is still in the planning stage, any suggestions
are welcome!
On Jun 4, 2013, at 7:21 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
> Err, I think there is a mistake below.
>
> You want to do [B'A]H_v, where H_v = A', the user's history of views as column vectors.
At least that is what my code was doing.
>
> On another subject the idea of truncating the user history vectors came up in another
thread. In some research we did using the Mahout itembased recommender with retail action
data we found that training on up to a year of data gave the best precision score in recommendations.
>
> Truncation of H_p or H_v column vectors does make sense if you want the most recent activity
of the user to be reflected in the recommendations. That doesn't mean you want to truncate
the data in A or B. Given the architecture of the current Mahout recommender this is a bit
hard to tease out since there isn't really any support for associating time with actions.
I'd want to see how it affects precision and, of course, eventually A/B test it.
>
> You don't always have time associated with actions. In the data I'm mining from Pinterest,
for example, the date one user followed another user is not available. So there is no reasonable
way to do truncation. Maybe Pinterest could do better.
>
> On Jun 1, 2013, at 1:13 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>
> I use MAP (mean average precision) http://en.wikipedia.org/wiki/Information_retrieval#Mean_average_precision
. Either with random held out data or more usually a timebased split. The oldest 90% for
train the the most recent 10% for test.
>
> We looked at the power of purchases and views for predicting purchases and as you'd expect
found purchases far far more predictive. Therefore I do the following for a crossrecommender:
>
> B = matrix purchases userId x itemId, 1=purchase, no value for no purchase.
> A = matrix views userId x itemId
> H_p = history of purchase vectors for each user, users as columns, so H_p = B'
>
> [B'B]H_p = R_p = recommendations made from user history. This, in my case is actually
the result of using the itembased mahout recommender trained on purchases and using the similarity
= loglikelihood (we tried other sims and this worked best). The [B'B] is actually row similarity
using loglikelihood. Since B'B is symmetric the row similarity works. For B'A, which is not
symmetric, row similarity doesn't workthere are two matrixes. To use loglikelihood sparsifier
I'm looking at modifying the matrix multiply code.
>
> [B'A]H_p = R_c = crossrecommendations based on cooccurrences of view with purchases.
>
> We also calculate the similarity matrix in both cases and do a linear combination of
4 recommendations.
>
> It sounds like we're using a different training set than you and only using views with
cooccurrences. If you measure MAP on a recommender built with views where you are calculating
how well is predicts known held out purchases vs training on purchases you will almost surely
find that views are much less predictive and unless you have a huge amount more view data
than purchases they will loose the comparison.
>
> I haven't implemented the learned weights yet but the overhead is not an issue. Once
you learn the weights you only have to check them once in a great while for a given data source.
They are constants at all recommendation time.
>
> BTW this can be used for other recommendations by chaining actions that have something
in common (users IDs?) To make things easier to account for we use a unified user Id and item
Id space for the action data, though it is not strictly necessary.
>
> Some related docs
> Have a look at Sean's notes here: http://myrrix.com/examplemergingrecommenders/
> Ted's preso here: http://www.slideshare.net/tdunning/searchasrecommendation
>
> On Jun 1, 2013, at 12:19 PM, Dominik Hübner <contact@dhuebner.com> wrote:
>
> Thanks for those detailed responses!
> So I assume that the problem of scaling the initial recommendations can be implicitly
solved by learning the weights for the linear combination. The log rank Ted mentioned seems
to be useful as well!
>
> What metric do you use to evaluate your recommendations? I currently setup a cooccurrence
matrix to predict based on purchases only, a Purchase X View matrix and a view only matrix.
> I am trying to recommend products leading people to purchases. So I somehow need to find
a way, to distinct between types recommendations and prefer "purchase" recommendations. Is
there any common approach to evaluate recommendations in such settings? Maybe I just didn't
get what you meant with: "In my case I weight the entire row vector, add the two, resort,
and calculate precision @ some number of recs."
> Using a percentile rank as shown in the paper for implicit ALS might be useful here.
The problem nevertheless is, that most users only have a single purchase.
>
> Furthermore, do you really perform gradient descent learning of the weights using hadoop/mahout?
Isn't this too costly to perform due to the overheads of the JVM and hadoop?
>
>
>
> On Jun 1, 2013, at 1:21 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>
>> I've got a crossrecommender too. It was originally conceived to do a multiaction
ensemble from Ted's notes. I'm now gathering a new data set and building the metamodel learner.
>>
>> Even with the same scale you need to learn the weighting factors. Consider a simple
ensemble case:
>>
>> R_p is the matrix of all recommendations from an itembase recommender using history
for any single action (purchase?)
>> S_p is the matrix of all similar items for a given item as measured by user actions
(purchases?)
>>
>> It is highly likely that you can improve your ranking (also agree ratings are seldom
useful) by incorporating both sets of data. So R = R_p + aS_p This is the linear combo Ted
mentions. But what is "a"? The scale of the strengths here are very similar but their importance
is not. In my case I found that S_p was far more predictive of a future purchase than R_p
but without R_p every user sees the same recommendations for a given item. So some combo of
the two may prove far stronger in practice than either alone. The combo certainly has produced
a better precision score.
>>
>> Once you learn "a" things are much simpler but to learn it you need an objective
function like improving precision. Then you must find what weighting "a" gives the highest
precision. This requires virtually the entire of R_p and S_p. Since you don't know "a" you
don't know how much of the sorted R or S you need to combine. Using the Solr idea you would
combine as many results as you can get from the queries trying different weightings and measuring
precision each time. In my case I weight the entire row vector, add the two, resort, and calculate
precision @ some number of recs.
>>
>> Implementing a simple form of gradient ascent seems like a good approach because
the full ensemble is going to have more than just "a" and therefore need help converging across
multiple dimensions.
>>
>> The version of the crossrecommender I have considers item similarities as well as
user history based recommendations we have an ensemble something like:
>>
>> R_p + aS_p + bR_v + cS_v = R
>> here R_v and S_v are from the cooccurrences of views with purchases, not just views.
>>
>> Ted's idea below about using the rank or log rank instead of whatever the similarity
metric returns as the "strength" seems like a really good one. It evens out the scales and
so may converge quicker if nothing else. Pretty easy to implement too.
>>
>>
>> Repetition alert! Does anyone have a data set to try this on?
>>
>>
>>
>> On May 31, 2013, at 1:15 PM, Koobas <koobas@gmail.com> wrote:
>>
>> Ted,
>> Thank you very much. This is very insightful.
>> The log scaling is definitely an intuitive way of building the meta model.
>> Not much disagreement about the uselessness of predicting ratings.
>>
>>
>> On Fri, May 31, 2013 at 4:00 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>>
>>> In my case, I put all the indicators from all different sources in the same
>>> Solr/Lucene index. Recommendations consists of making a single query to
>>> Solr/Lucene with as much data as I have or want to include.
>>>
>>> At the point that this query is done, there are no weights on the
>>> indicators ... merely presence or absence in a field or query. The weights
>>> that I typically use are computed on the fly by Lucene's default similarity
>>> score and the results tend to be very good. There is no issue of combining
>>> scores on different scales since there is only one composite score.
>>>
>>> If you *really* want to build multiple models using different technologies
>>> and combine them, you need a socalled metamodel. There are many ways to
>>> build such a beast. A very simple way is to reduce all scores to quantiles
>>> then to a logodds scale (taking care not to ever estimate a quantile as
>>> either 0 or 1). A linear combination of these rescaled scores can work
>>> pretty well although you do have to learn the linear weights.
>>>
>>> Sometimes scores vary strongly from query to query. In such cases,
>>> reducing a score to being some kind of rank statistic can be helpful. For
>>> instance, you may want to have a score that is the log of the rank that an
>>> item appears at in the results list. You might also be able to normalize
>>> scores based on properties of the query. Such rankbased or normalized
>>> scores can then often be combined by any metamodel, including the one I
>>> mentioned above.
>>>
>>> You should also look at the netflix papers, especially the one describing
>>> the winning entry for more ideas on model combination. The major
>>> difference there is that they were trying to predict a rating which is a
>>> task that I find essentially useless since ranking is so much more
>>> important in most realworld applications. Others may dispute my
>>> assessment on this, of course.
>>>
>>> There are many ways of building the metamodel that you need, but one
>>> overriding thought that I have is that the deviations from ideal in all
>>> real cases will be large enough that theory should not be taken too
>>> literally here, but rather should be used as a weak, though still useful,
>>> inspirational guide.
>>>
>>>
>>> On Fri, May 31, 2013 at 3:18 PM, Koobas <koobas@gmail.com> wrote:
>>>
>>>> I am also very interested in the answer to this question.
>>>> Just to reiterate, if you use different recommenders, e.g.,
>>>> kNN userbased, kNN itembased, ALS, each one produces
>>>> recommendations on a different scale. So how do you combine them?
>>>>
>>>>
>>>> On Fri, May 31, 2013 at 3:07 PM, Dominik Hübner <contact@dhuebner.com
>>>>> wrote:
>>>>
>>>>> Hey,
>>>>> I have implemented a cross recommender based on the approach Ted
>>> Dunning
>>>>> proposed (cannot find the original post, but here is a follow up
>>>>> http://www.mailarchive.com/user@mahout.apache.org/msg12983.html).
>>>>> Currently I am struggling with the last step of blending the initial
>>>>> recommendations.
>>>>>
>>>>> My current approach:
>>>>> 1. Compute a cooccurrence matrix for each useful combination of
>>>>> userproduct interaction (e.g. which product views and purchased do
>>>> appear
>>>>> in common …)
>>>>> 2. Perform initial recommendation based on each matrix and the required
>>>>> type of user vector (e.g. a user's history of views OR purchases) (like
>>>> the
>>>>> itembased recommender implemented in Mahout)
>>>>>
>>>>> In step 2, I adapted the AggregateAndRecommendReducer of Mahout, which
>>>>> normalizes vectors while building the sum of weighted similarities or
>>> in
>>>>> this case => cooccurrences.
>>>>>
>>>>> Now I end up with multiple recommendations for each product, but all
of
>>>>> them are on a different scale.
>>>>> How can I convert them to have the same scale, in order to be able to
>>>>> weight them and build the linear combinations of initial
>>> recommendations
>>>> as
>>>>> Ted proposed?
>>>>> Would it make sense to normalize user vectors (before multiplying) as
>>>> well?
>>>>>
>>>>> Otherwise views would have a much higher influence than purchases due
>>> to
>>>>> their plain characteristics (they just appear way more frequently). Or
>>> is
>>>>> this the reason for weighting purchases higher and views lower? If so,
>>> I
>>>>> think it's sort of inconvenient. Wouldn't it be much more favorable to
>>>> get
>>>>> each type of interaction within the same scale and use the weights just
>>>> to
>>>>> control each types influence on the final recommendation?
>>>>>
>>>>> Thanks in advance for any suggestions!
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>> Dominik
>>>>>
>>>>> Sent from my iPhone
>>>>
>>>
>>
>
>
>
