mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Zohar <disso...@gmail.com>
Subject Re: Mahout performance issues
Date Fri, 02 Dec 2011 18:07:59 GMT
On Fri, Dec 2, 2011 at 6:42 PM, Sean Owen <srowen@gmail.com> wrote:

> I'll commit #2 since that's an obvious win.
>
>
Great.


> I'll make a patch with a proposal for #1, which is sort of like what you're
> doing, just slightly different -- it gives a much stronger cap on the
> number of interactions considered.
>
> Alright.


>
> For #3: You can't just store counts -- you need the user IDs to compute
> intersection sizes.
>
> I'm suggesting to have an additional Map<Long, Long>
under GenericBooleanPrefDataModel. There will still
be GenericBooleanPrefDataModel.preferenceFromUsers
and GenericBooleanPrefDataModel.preferenceForItems only that the latter
might be a lot smaller in cases like mine.


> If I'm right about the reason #3 makes it run faster, then #2 should make
> it run faster by about the same amount -- so #2 subsumes #3. And if that's
> true, I'd strongly prefer to not do #3 as it would add little and make
> things incorrect.
>
> I don't think that the reason that #3 helps is because it makes fewer items
> available as candidates, because it shouldn't. Say user A rates items X, Y
> and Z. Say user B has rated only Y. When recommending for A, we'll find the
> set of users who have rated Y (A, B and others). And when we look at B,
> we'll add Y as a candidate item (the only item B rated). But Y will be
> removed at the end since A already rated it. So, B didn't add any more
> item-item interactions to consider. If you remove user B, you still have
> the same work to do.
>
> I think #3 "helps" because it sets the count of users per item, for some
> items, to 0 instead of 1, which avoids an intersection computation. But
> that's later, after you have gotten the list of candidates. It's not
> reducing the number of candidates, but ignoring a lot of them later by
> assuming their similarity is 0 (when it isn't!)
>
> But that intersection computation doesn't have to be so slow. It's slow to
> check whether each of 1000 items in one set are members of another set of 1
> item. But checking 1 item for membership in another set of 1000 is just one
> quick hash lookup.
>
> Now, I agree that pretending that set has 0 items instead of 1 is even a
> little faster still! And perhaps you could argue it's a reasonable
> approximation. But my guess is that it's barely faster, at the price of
> breaking correctness.
>
> I definitely agree that the correctness should not be broken. My solution
is not meant to decrease the number of possible items like you stated in
your example. It was meant to reduce the amount of item-user associations
(while preserving user-item associations) which will results much less
effort on intersectionSize(). Even in the case that we have two popular
items, item A with 50k interactions and item B with 100k interactions,
there are still 50k lookups inside intersectionSize(). If 50% of users
interacted with that item interacted _only_ with that item, we are talking
about a huge saving, are we not?
For example, how many of the people who bought The Da Vinci Code on Amazon,
had bought other books there as well?


> On Fri, Dec 2, 2011 at 3:58 PM, Daniel Zohar <dissoman@gmail.com> wrote:
>
> > It's already hard to say what has the most significant impact here. Just
> to
> > summarize, we had basically 3 changes which made an enormous improvement
> in
> > performance :
> > 1. Limit the number of 'possibleItemIDs' returned
> > by SamplingCandidateStrategy
> > 2. Optimizing getNumUsersWithPreferenceFor() with
> "larger.intersectionSize(
> > smaller)"
> > 3. Excluding 'singleton users' from
> > GenericBooleanPrefDataModel.preferenceForItems
> >
> > I think we already agreed about 1 & 2 (correct me if I'm wrong).
> > Regarding 3, if we had a Map that mapped itemID to numOfUsers that might
> > solve the problem and still give a great performance boost. You can see
> in
> > my previous results that this change diminishes query time to at least a
> > quarter (2s vs 0.5s). What do you think?
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message