mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanir Seroussi <yanir.serou...@gmail.com>
Subject Re: A question regarding GenericUserBasedRecommender
Date Tue, 10 Aug 2010 06:40:11 GMT
On Tue, Aug 10, 2010 at 15:54, Sean Owen <srowen@gmail.com> wrote:

> On Mon, Aug 9, 2010 at 6:46 PM, Yanir Seroussi <yanir.seroussi@gmail.com>
> wrote:
> > As I see it, there are two separate tasks here. The first one is the
> > recommendation task, where it makes sense to take the N most similar
> users
> > and generate recommendations based on their preferences. The second one
> is
> > the rating prediction task (or preference estimation in Mahout terms),
> where
> > given a target user and a target item the recommender generates a
> prediction
> > of the rating for the item. In the way
>
> They are related tasks. Estimating preferences is *a* way to achieve
> the further goal of recommending items. It is how just about all
> algorithms work, so I'd say that in practice estimating preference is
> a sub-task of recommendation, and really the core of it.
>
>
Agreed.


> > GenericUserBasedRecommendation.estimatePreference() is implemented now,
> it
> > can only generate predictions for very few items unless the neighbourhood
> > size is very large.
>
> I don't think this is necessarily true. Sparse data can mean users are
> hard to connect to other users, or that those connections yield few
> candidates for recommendations, sure. A large neighborhood doesn't
> help this though. If there isn't 1 "nearby" user, there aren't 100
> either.
>
> But, crudely, I think your data has to be "really really sparse"
> before you frequently can't recommend anything at all.
>
> I don't believe it's a question of implementation. This is simply what
> user-based recommendation is.
>
>
As it is now, GenericUserBasedRecommendation.estimatePreference() can only
estimate preferences of items that were rated by users in the static
neighbourhood. Thus, a larger neighbourhood is likely to contain more
neighbours that rated unknown items. For example, if we take only one
neighbour, we can only estimate preferences of items that this neighbour
rated.


>
>
> > To make this concrete, I'll give an example. Suppose we choose N=20, and
> we
> > want to predict the rating of the target user for the movie The
> Godfather.
> > This movie was rated by many users in the dataset, but unfortunately it
> > wasn't rated by the 20 users that are most similar to the target user.
> Thus,
> > estimatePreference() will return NaN as the prediction.
>
> That's right. There would be no basis for estimating a preference.
>
>
> > Now suppose that these top 20 users have a similarity of 1 to the target
> > user, and that the next 20 users have a similarity of 0.99 to the target
> > user, and they have all rated The Godfather. It would make sense to
> generate
> > a rating prediction based on the next 20 users, and this prediction is
> > actually likely to be pretty good.
>
> True, and if this were the case, I'd suggest that you want to define
> the neighborhood by a similarity threshold, not by size. Let everyone
> with similarity >= 0.9 or something be in the neighborhood.
>
>
> > As I mentioned in my original message, it is implied from Herlocker et al
> > (1999) that the number of neighbours shouldn't affect coverage. Another
> > widely cited paper, by Adomavicius and Tuzhilin (2005, see
> >
> http://ids.csom.umn.edu/faculty/gedas/papers/recommender-systems-survey-2005.pdf
> ),
> > actually states that explicitly: "That is, the
> > value of the unknown rating r(c, s) for user c and item s is usually
> > computed as an aggregate of the ratings of some other (usually the N most
> > similar) users for the same item s: r(c, s) = aggr(c' in C^, r(c', s))
> where
> > C^ denotes the set of N users that are the most similar to user c and who
> > have rated item s (N can range anywhere from 1 to the number of all
> users)."
> >
> > While Sean's approach is probably also okay, especially for the
> > recommendation scenario, I think it's worthwhile documenting the fact
> that
> > this behaviour is somewhat inconsistent with the literature.
>
> As (I think) you say, it's consistent with Herlocker -- this is just
> plain-vanilla user-based recommendation.
>
> The other approach you cite makes sense but gets a lot more expensive
> to compute. You need a new neighborhood for every candidate item. It
> is also, in my world view, not the plain-vanilla approach that has
> become called "user-based recommendation". So, no I do not find this
> inconsistent.
>

No, I'm not saying that it's consistent with Herlocker et al. They say in
section 6 that "The second technique is to pick the best n correlates for a
given n. This technique performs reasonably well, as it does not limit
prediction coverage". The fact that no matter what n you choose, your
coverage remains the same, means that the n users are dependent on the
target item. Also, in section 2 of the paper, they say that the second step
in generating a prediction is "select a subset of users to use as a set of
predictors (possibly for a specific item)". The other neighbourhood
selection technique they experimented with was setting similarity threshold,
which did affect coverage, as it is not item-specific. Putting all of this
together strongly suggests that the nearest-n approach they were referring
to is the item-specific one.

In addition, the paper I cited is a survey of recommender systems, and was
cited more than 1000 times (according to Google Scholar), which means that
probably quite a few people actually see the non-static neighbourhood as
plain-vanilla user-based recommendation.

I agree that the non-static approach is more expensive to compute in Mahout.
But if we disregard the way it is implemented at the moment, it doesn't
necessarily have to be expensive, because we have to compute the
similarities of the target user to all other users in any case. So if we
were to keep a sorted list of these similarities, generating a prediction
for an item would only consist of taking the top-n users in the list that
rated the item. However, it would probably require big changes in Mahout, so
it's probably better to leave it as-is and just let users know of this
issue.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message