mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Recommendations from flat data
Date Fri, 01 May 2009 04:22:14 GMT


Some feedback from my Taste experience.  Tanimoto was the bottleneck for me, too.  I used
the highly sophisticated kill -QUIT pid method to determine that.  Such kills always caught
Taste in Tanimoto part of the code.

Do you know, roughly, what that nontrivial amount might be? e.g. 10% or more?

Also, does the "nearly instantaneous" refer to calling Taste with a single recommend request
at a time?  I'm asking because I recently did some heavy duty benchmarking and things were
definitely not instantaneous when I increased the number of concurrent requests.  To make
things fast (e.g. under 100 ms avg.) and run in reasonable amount of memory, I had to resort
to remove-noise-users-and-items-from-input-and-then-read-the-data-model.... which means users
who look like noise to the system (and that's a lot of them in order to keep things fast and
limit memory usage) will not get recommendations.

Sematext -- -- Lucene - Solr - Nutch

----- Original Message ----
> From: Sean Owen <>
> To:
> Sent: Thursday, April 30, 2009 7:18:28 PM
> Subject: Re: Recommendations from flat data
> After digging in this evening I have some answers I think.
> First, can you use the very latest code from Subversion? Because the
> DataModel you use has actually been removed and rolled into FileDataModel.
> This is also because I checked in a change tonight that should cut down peak
> memory usage while constructing a FileDataModel by a nontrivial amount.
> I was able to run recommendations over 10M data points in 768M of memory
> tonight.
> It does take some time to parse and build the model. After that the
> recommendation is nearly instantaneous with any similarity metric. Are you
> sure Tanimoto was taking a longer time - meaning did you test over a lot of
> recommendations?
> Either way there are certainly some params you can tweak to trade a bit of
> accuracy (maybe) for speed. Look at the sampling rate param on the user
> neighborhood implementation. Set it to like 10% and it should get much
> faster - of course this doesn't change startup overhead though.
> On Apr 30, 2009 7:52 PM, "Sean Owen" wrote:
> Hm, something is off indeed. Tanimoto should be notably faster than a
> cosine measure correlation -- it's doing a simple, optimized set
> intersection and union rather than iterating over a bunch of
> preference values. While 5M data points is going to consume a
> reasonable amount of memory, I would not guess it would exhaust a 1GB
> heap -- should be in the hundreds of megs.
> If you can run only the recommender in the JVM, obviously that frees
> up memory. I would probably remove the caching wrapper too if memory
> is at a premium, but that's not your problem. If you are running on a
> 64-bit machine in 64-bit mode, try 32-bit mode (-d32) to reduce the
> object overhead in the JVM.
> From there, you could load the data in a DB instead and use a
> JDBC-based DataModel, since that doesn't load in memory. You could
> also try adapting my NetflixDataModel which reads from data organized
> in directories on disk.
> But no something just doesn't seem right, your current setup should be
> OK.  I think I need to try replicate this with a similarly sized data
> set and see what's up.
> On Thu, Apr 30, 2009 at 5:48 PM, Paul Loy wrote: > Hi
> Sean, > > that worked f...

View raw message