mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Recommendations from flat data
Date Fri, 01 May 2009 07:32:16 GMT
On Fri, May 1, 2009 at 5:22 AM, Otis Gospodnetic
<otis_gospodnetic@yahoo.com> wrote:
> Some feedback from my Taste experience.  Tanimoto was the bottleneck for me, too.  I
used the highly sophisticated kill -QUIT pid method to determine that.  Such kills always
caught Taste in Tanimoto part of the code.

Yeah, er, the correlation is certainly consuming most of the time in
this scenario. Tanimoto should now be slow*er* the cosine measure
though.

By default the user neighborhood component searches among all users
for the closest neighbors. That's a lot of similarities to compute,
and why it might be better to just draw from a sample of all users.

> Do you know, roughly, what that nontrivial amount might be? e.g. 10% or more?

It really depends on the nature of the data and what tradeoff you want
to make. I have not studied this in detail. Anecdotally, on a
large-ish data set you can ignore most users and still end up with an
OK neighborhood.

Actually I should do a bit of math to get an analytical result on
this, let me do that.


> Also, does the "nearly instantaneous" refer to calling Taste with a single recommend
request at a time?  I'm asking because I recently did some heavy duty benchmarking and things
were definitely not instantaneous when I increased the number of concurrent requests.  To
make things fast (e.g. under 100 ms avg.) and run in reasonable amount of memory, I had to
resort to remove-noise-users-and-items-from-input-and-then-read-the-data-model.... which means
users who look like noise to the system (and that's a lot of them in order to keep things
fast and limit memory usage) will not get recommendations.

I suppose I just meant compared to loading the entire DataModel. It
should have been in the hundreds of milliseconds compared to a good 30
seconds.

One recent benchmark I can offer is that on a chunky machine (8 core
@2GHz or so, 20GB RAM), using the 10M rating data set from GroupLens
and slope-one, recommendations are produced in about 400ms each. Not
terrible, but slow for real-time usage. Precomputing in some way seems
ideal.

Locally on my desktop (2 core @ 2.5GHz, 1GB heap) this sample code is
producing recommendations in about 550ms. If you go to 10% sampling
that goes to 350ms or so.

Concurrency: response time should be reasonably constant as long as
the number of concurrent requests is <= the number of cores. The
factor that might slow this down are caches in the code, which have a
bit of synchronization, and I have found that to be a minor
bottleneck. Obviously once you scale beyond the number of cores the
response time increases linearly with the number of concurrent
requests.

Mime
View raw message