mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Mahout performance issues
Date Fri, 02 Dec 2011 15:38:52 GMT
On Fri, Dec 2, 2011 at 2:33 PM, Daniel Zohar <> wrote:

> On Fri, Dec 2, 2011 at 1:53 PM, Sean Owen <> wrote:
> > On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <>
> wrote:
> >
> > > I'm already capping it at 100. If this will be my last resort, I will
> > > decrease it more :)
> > >
> >
> > This just can't be... 100 item-item similarities takes milliseconds to
> > compute. Something else is going on.
> > I should make a JIRA to propose my own version of this filtering just to
> > make sure we're talking about the same thing.
> >
> >
> >
> I limit only the possibleItemIDs in my altered version
> of SamplingCandidateItemsStrategy
> Sine TopItems.getTopItems() computes similarities for every previously
> interacted item with the set of 'possibleItemIDs', you are correct only
> when the user have a single interaction. However, if the user had made 20
> interactions, we're talking about 2000 item-item similarities.

Sure, but why not limit those 2000 interactions to 100?

I'm not sure "100" is the optimal number or not, but clearly, at smaller
scale, this all runs quite fast and produces reasonable results. Say that,
at some smaller scale, it considers on average X interactions. If you turn
this down such that only X interactions are considered here, I don't see
why you wouldn't get similar performance and quality.

I was actually considering a patch that would simply apply the max both to
the number of users sampled per item, but the number of items from each of
those users. If you clamped it at 10, then each item you've rated would
produce at most 10*10 interactions. We only limit one of those things now.

Well... maybe with the change below it speeds up further so that you can
get away with more data in less time, so that it is much easier to find a
good tradeoff.

> You nailed it! It extremely improves the performance. Without my last fix,
> we're talking about max 2s with my fix, it didn't go over 0.5s!
.. but does this change alone produce a good speedup?

> I still don't see any problem with not including 'singleton users' inside
> preferenceForItems as long as preferenceFromUsers stays intact. Can you
> please elaborate more on the problem as you see it? I feel we're some kind
> of communication problem :P

The calculations are just wrong then, since you don't have the right user
counts per item. Methods that answer that question give the wrong answer;
similarity metrics like log-likelihood give slightly wrong results too. At
this point I don't think it's good to sacrifice correctness for an
optimization when there seems to be (?) a way to have most of the speed up
without any correctness problem.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message