mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Zohar <>
Subject Re: Mahout performance issues
Date Fri, 02 Dec 2011 11:28:43 GMT
On Fri, Dec 2, 2011 at 1:10 PM, Sean Owen <> wrote:

> On Fri, Dec 2, 2011 at 11:02 AM, Daniel Zohar <> wrote:
> > Hi guys,
> >
> > @Sean, You are obviously right by saying that reducing the cap limit
> would
> > yield better performance. However I believe it would yield worse
> accuracy.
> > This is because the more items a user interacted with, the smaller is
> > the percentage of the capped possible items relatively to the actual
> > possible items.
> >
> That's right. I'm saying there must be a middle ground that works on both
> counts, since it works fine at smaller scales, where you only have hundreds
> of interactions per recommendation computation. So, if you tune it to use
> 100, for example, I imagine you get "good" recommendations and it should be
> pretty fast, right?
> I don't see why this isn't the solution.
I'm already capping it at 100. If this will be my last resort, I will
decrease it more :)

> >
> > I just ran the fix I proposed earlier and I got great results! The query
> > time was reduced to about a third for the 'heavy users'. Before it was
> 1-5
> > secs and now it's 0.5-1.5. The best part is that the accuracy level
> should
> > remain exactly the same. I also believe it should reduce memory
> > consumption, as the GenericBooleanPrefDataModel.preferenceForItems gets
> > significantly smaller (in my case at least).
> >
> > The fix is merely adding two lines of code to one of
> > the GenericBooleanPrefDataModel constructors. See
> >, the lines I added are #11, #22.
> >
> I don't think this works though, because you've deleted the one data point
> you have for those users. They can't get recommendations now.
> I can't figure out how that speeds up recommendations though, what am I
> missing? these users aren't providing any more item-item interactions to
> consider.

You know this code way better than I do, so perhaps I am missing something
here. But as I see it (and I tested it as well) the users data point
remains intact. That's because the preferenceFromUsers Set remains the same
while only preferenceForItems is optimized. The main reason it improves
performance is because of the bottleneck we diagnosed before -
`GenericBooleanPrefDataModel.getNumUsersWithPreferenceFor` which in turn
calls `FastIDSet.intersectionSize`. Now, if we know _for sure_ that a user
interacted with a single item only, what's the point of checking every time
if it had interacted with other items? (I hope I make myself clear)
Because in my data set, we have over 80% of users which had a single
interaction, it gives such a performance boost. (I believe this case might
be more common than one might think in web apps)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message