mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Mahout performance issues
Date Fri, 02 Dec 2011 11:53:15 GMT
On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <> wrote:

> I'm already capping it at 100. If this will be my last resort, I will
> decrease it more :)

This just can't be... 100 item-item similarities takes milliseconds to
compute. Something else is going on.
I should make a JIRA to propose my own version of this filtering just to
make sure we're talking about the same thing.

> You know this code way better than I do, so perhaps I am missing something
> here. But as I see it (and I tested it as well) the users data point
> remains intact. That's because the preferenceFromUsers Set remains the same
> while only preferenceForItems is optimized. The main reason it improves
> performance is because of the bottleneck we diagnosed before -
> `GenericBooleanPrefDataModel.getNumUsersWithPreferenceFor` which in turn
> calls `FastIDSet.intersectionSize`. Now, if we know _for sure_ that a user
> interacted with a single item only, what's the point of checking every time
> if it had interacted with other items? (I hope I make myself clear)
> Because in my data set, we have over 80% of users which had a single
> interaction, it gives such a performance boost. (I believe this case might
> be more common than one might think in web apps)

Let me propose a better way to address that bottleneck. I think the problem
is that the intersection computation is dumb, and should really compute

Try ending getNumUsersWithPreferenceFor() with:

    return userIDs1.size() < userIDs2.size() ?
        userIDs2.intersectionSize(userIDs1) :

It won't produce the same speedup, but it's more correct than omitting this
data just to get this effect. If it gets 80% of the speedup, that's a great

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message