mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Blechschmidt <>
Subject Re: Mahout performance issues
Date Thu, 01 Dec 2011 08:52:10 GMT

On 01.12.2011, at 09:37, Sebastian Schelter wrote:

> Daniel, can you plot two curves showing the distribution of
> interactions per user and the distribution of interactions per item? I
> think we need to get a better picture of your data first.
> Generally I always recommend to use precomputed similarities. You can
> still serve new users with realtime recommendations, the only
> disadvantages are the higher complexity and a delayed inclusion of new
> items.

In this paper:
Fast Online Learning through Offline Initialization for Time-sensitive Recommendation
Deepak Agarwal et. al. describes a solution how to include new items quickly into the recommendations.
This is used for personalizing the news stories on the yahoo start page.

@Daniel: I would also recommend to profile your application with JVisualVM:

After I did this with my recommender. I figured out that the default cache size for item similarities
far to low. The details are described in this ticket:

> --sebastian


> 2011/11/30 Sean Owen <>:
>> The simple answer is that:
>> Mahout absorbed a non-distributed recommender project called Taste, which
>> scales up to a point which may be sufficient for a lot of users. It
>> certainly is a lot simpler. Yes it is realistic to do near-real-time
>> recommendations, though it gets harder and harder and requires more tuning,
>> tradeoffs and optimization as this thread shows.
>> The rest, written from scratch, is almost all distributed and Hadoop-based,
>> including distributed re-implementations of the same algorithms.
>> On Wed, Nov 30, 2011 at 8:23 PM, Dan Beaulieu
>> <>wrote:
>>> Hi all, this is a tangent and can mostly be ignored by the people
>>> interested in this problem.
>>> I'm new to Machine Learning and especially Mahout. Following this
>>> discussion has made me a bit confused.
>>> Isn't Mahout used for large datasets where it makes sense to distribute the
>>> work? Why then isn't anyone pointing
>>> out that the problem may be the use of one single Mahout node? Is it
>>> because it's boolean based? Is it because the data set
>>> isn't really that large?
>>> Even if for whatever reason a single node will do for this case, is it
>>> really expected that the recommendation process would finish in less than
>>> half a second?
>>> This makes me think if that is the expectation then the data set is
>>> actually small and Mahout might be overkill...
>>> What obvious piece of the Mahout puzzle am I missing?
>>> Thanks.
>>> Dan
>>> On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <> wrote:
>>>> Have you used CachingItemSimilarity? That will hold common similarities
>>> in
>>>> memory. It's a lot easier than pre-computing and might help.
>>>> I think something like your change is a good one (Sebastian what do you
>>>> think) in that it gives you the ultimate lever to control how many
>>>> candidates are evaluated. That ought to make it go as fast as you like,
>>> but
>>>> it trades off quality. Still I'd be really surprised if there's no viable
>>>> middle ground -- this works fine at smaller scale, where 100s of
>>> candidates
>>>> are evaluated, perhaps, and you can use your lever to get to 100s of
>>>> candidates at your scale too. Is that still both slow and inaccurate?
>>>> On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <>
>>> wrote:
>>>>> I just tested the app with Mahout 0.6.
>>>>> There seems to be a small performance improvement, but still
>>>>> recommendations for the 'heavy users' take between 1-5 seconds.

Manuel Blechschmidt
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message