mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Mahout performance issues
Date Fri, 02 Dec 2011 16:42:38 GMT
I'll commit #2 since that's an obvious win.

I'll make a patch with a proposal for #1, which is sort of like what you're
doing, just slightly different -- it gives a much stronger cap on the
number of interactions considered.


For #3: You can't just store counts -- you need the user IDs to compute
intersection sizes.

If I'm right about the reason #3 makes it run faster, then #2 should make
it run faster by about the same amount -- so #2 subsumes #3. And if that's
true, I'd strongly prefer to not do #3 as it would add little and make
things incorrect.

I don't think that the reason that #3 helps is because it makes fewer items
available as candidates, because it shouldn't. Say user A rates items X, Y
and Z. Say user B has rated only Y. When recommending for A, we'll find the
set of users who have rated Y (A, B and others). And when we look at B,
we'll add Y as a candidate item (the only item B rated). But Y will be
removed at the end since A already rated it. So, B didn't add any more
item-item interactions to consider. If you remove user B, you still have
the same work to do.

I think #3 "helps" because it sets the count of users per item, for some
items, to 0 instead of 1, which avoids an intersection computation. But
that's later, after you have gotten the list of candidates. It's not
reducing the number of candidates, but ignoring a lot of them later by
assuming their similarity is 0 (when it isn't!)

But that intersection computation doesn't have to be so slow. It's slow to
check whether each of 1000 items in one set are members of another set of 1
item. But checking 1 item for membership in another set of 1000 is just one
quick hash lookup.

Now, I agree that pretending that set has 0 items instead of 1 is even a
little faster still! And perhaps you could argue it's a reasonable
approximation. But my guess is that it's barely faster, at the price of
breaking correctness.


On Fri, Dec 2, 2011 at 3:58 PM, Daniel Zohar <dissoman@gmail.com> wrote:

> It's already hard to say what has the most significant impact here. Just to
> summarize, we had basically 3 changes which made an enormous improvement in
> performance :
> 1. Limit the number of 'possibleItemIDs' returned
> by SamplingCandidateStrategy
> 2. Optimizing getNumUsersWithPreferenceFor() with "larger.intersectionSize(
> smaller)"
> 3. Excluding 'singleton users' from
> GenericBooleanPrefDataModel.preferenceForItems
>
> I think we already agreed about 1 & 2 (correct me if I'm wrong).
> Regarding 3, if we had a Map that mapped itemID to numOfUsers that might
> solve the problem and still give a great performance boost. You can see in
> my previous results that this change diminishes query time to at least a
> quarter (2s vs 0.5s). What do you think?
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message