As a small followup on this, here's a small result that should hold 
Setting the sampling rate to, say, 1/X (i.e. if you set it to 20%,
X=5), should reduce the time spent in finding a neighborhood by a
factor of X. Of course. Assuming users are pretty evenly scattered
around your ratingspace, the average distance to users in your
computed neighborhood also increases by a factor of X.
So you get results X times faster, but the results you get are X times
'worse'. This sounds bad but consider that users 5 times farther away
in your ratingspace may still be suitable neighbors and yield the
same recommendations.
On Fri, May 1, 2009 at 8:32 AM, Sean Owen <srowen@gmail.com> wrote:
> It really depends on the nature of the data and what tradeoff you want
> to make. I have not studied this in detail. Anecdotally, on a
> largeish data set you can ignore most users and still end up with an
> OK neighborhood.
>
> Actually I should do a bit of math to get an analytical result on
> this, let me do that.
