mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Mahout performance issues
Date Fri, 02 Dec 2011 18:00:43 GMT
Does #1 mean down-sample the items in each user?  Or does it only
down-sample the number of items for the user that we are producing
recommendations for?

I recommend down-sampling for all users.  IF you down-sample biased toward
low frequency items, then this will also kill the problem of high frequency
items and you get all the performance gains you are talking about and more,
without significant error.

On Fri, Dec 2, 2011 at 7:58 AM, Daniel Zohar <dissoman@gmail.com> wrote:

> It's already hard to say what has the most significant impact here. Just to
> summarize, we had basically 3 changes which made an enormous improvement in
> performance :
> 1. Limit the number of 'possibleItemIDs' returned
> by SamplingCandidateStrategy
> 2. Optimizing getNumUsersWithPreferenceFor() with "larger.intersectionSize(
> smaller)"
> 3. Excluding 'singleton users' from
> GenericBooleanPrefDataModel.preferenceForItems
>
> I think we already agreed about 1 & 2 (correct me if I'm wrong).
> Regarding 3, if we had a Map that mapped itemID to numOfUsers that might
> solve the problem and still give a great performance boost. You can see in
> my previous results that this change diminishes query time to at least a
> quarter (2s vs 0.5s). What do you think?
>
> On Fri, Dec 2, 2011 at 5:38 PM, Sean Owen <srowen@gmail.com> wrote:
>
> > On Fri, Dec 2, 2011 at 2:33 PM, Daniel Zohar <dissoman@gmail.com> wrote:
> >
> > > On Fri, Dec 2, 2011 at 1:53 PM, Sean Owen <srowen@gmail.com> wrote:
> > >
> > > > On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <dissoman@gmail.com>
> > > wrote:
> > > >
> > > > > I'm already capping it at 100. If this will be my last resort, I
> will
> > > > > decrease it more :)
> > > > >
> > > >
> > > > This just can't be... 100 item-item similarities takes milliseconds
> to
> > > > compute. Something else is going on.
> > > > I should make a JIRA to propose my own version of this filtering just
> > to
> > > > make sure we're talking about the same thing.
> > > >
> > > >
> > > >
> > > I limit only the possibleItemIDs in my altered version
> > > of SamplingCandidateItemsStrategy
> > > Sine TopItems.getTopItems() computes similarities for every previously
> > > interacted item with the set of 'possibleItemIDs', you are correct only
> > > when the user have a single interaction. However, if the user had made
> 20
> > > interactions, we're talking about 2000 item-item similarities.
> > >
> >
> > Sure, but why not limit those 2000 interactions to 100?
> >
> > I'm not sure "100" is the optimal number or not, but clearly, at smaller
> > scale, this all runs quite fast and produces reasonable results. Say
> that,
> > at some smaller scale, it considers on average X interactions. If you
> turn
> > this down such that only X interactions are considered here, I don't see
> > why you wouldn't get similar performance and quality.
> >
> > I was actually considering a patch that would simply apply the max both
> to
> > the number of users sampled per item, but the number of items from each
> of
> > those users. If you clamped it at 10, then each item you've rated would
> > produce at most 10*10 interactions. We only limit one of those things
> now.
> >
> > Well... maybe with the change below it speeds up further so that you can
> > get away with more data in less time, so that it is much easier to find a
> > good tradeoff.
> >
> >
> >
> > >
> > > You nailed it! It extremely improves the performance. Without my last
> > fix,
> > > we're talking about max 2s with my fix, it didn't go over 0.5s!
> > >
> > >
> > >
> > .. but does this change alone produce a good speedup?
> >
> >
> >
> > > I still don't see any problem with not including 'singleton users'
> inside
> > > preferenceForItems as long as preferenceFromUsers stays intact. Can you
> > > please elaborate more on the problem as you see it? I feel we're some
> > kind
> > > of communication problem :P
> > >
> >
> > The calculations are just wrong then, since you don't have the right user
> > counts per item. Methods that answer that question give the wrong answer;
> > similarity metrics like log-likelihood give slightly wrong results too.
> At
> > this point I don't think it's good to sacrifice correctness for an
> > optimization when there seems to be (?) a way to have most of the speed
> up
> > without any correctness problem.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message