mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Zohar <disso...@gmail.com>
Subject Re: Mahout performance issues
Date Thu, 01 Dec 2011 22:35:13 GMT
Sebastian, as I wrote before, it's the other way around. ~8.5M users had
only chosen a single item. The item with the most interactions is about
400k.
This is why I'm looking now into improving GenericBooleanPrefDataModel to
not take into account users which made one interaction under the
'preferenceForItems' Map. What do you think about this approach?

On Thu, Dec 1, 2011 at 8:10 PM, Sebastian Schelter <ssc@apache.org> wrote:

> If I remember correctly, you have 12M users and 18M interactions.
>
> If I interpret the plots correctly there is one single item that
> accounts for 8.5M interactions (nearly half of the overall interactions)
> and more than two thirds of the users like it?
>
> If that is true, this item will co-occurr with virtually every other
> item in the dataset, ruining the runtime as you will have to estimate
> the preference for every item each time you compute recommendations.
>
> Normally the sampling done by SamplingCandidateItemStrategy should hit
> such 'top-sellers' harder then the rest and therefore mitigate the
> impact of them on the runtime, but I guess your dataset has so few
> per-user interactions overall that the sampling doesn't really help here.
>
> This top item is also of no real value as everybody seems to already
> know it and was able to find it. You can't really learn a lot from an
> item that everybody likes.
>
> Can you check my findings and try to simply throw the item away?
>
> --sebastian
>
>
>
> On 01.12.2011 16:16, Sebastian Schelter wrote:
>
> >
> > --sebastian
> >
> > On 01.12.2011 16:12, Sean Owen wrote:
> >> You can 'tickle' the cache asynchronously if you like.
> >>
> >> I am still not clear on why you are doing so many item-item similarity
> >> calculations. Your change ought to let you do 1, or 10, or 100 per
> >> calculation if you like. That, we know, is fast. And a few hundred
> >> similarities should start to give reasonable recommendations.
> >>
> >> What is preventing you from making this tradeoff (with your change)?
> >> Yes, it is essential for reasonable performance.
> >>
> >> On Thu, Dec 1, 2011 at 3:06 PM, Daniel Zohar <dissoman@gmail.com>
> wrote:
> >>
> >>> Hi Manuel,
> >>> I haven't got to the point where CacheItemSimilarity kicks in. That
> is, I
> >>> will have to run a lot of recommendations in order to get a real
> benefit
> >>> from it. I would first like to optimize the 'cold start' so it's at
> least
> >>> serves at reasonable time. Usually cache is used to prevent repeated
> >>> calculations, but personally I dont think it's a replacement for
> optimized
> >>> performance. Don't you agree?
> >>>
> >>> Also, I will try to profile the app now as you suggest and send the
> results
> >>> asap.
> >>>
> >>> Thanks!
> >>>
> >>> On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt <
> >>> Manuel.Blechschmidt@gmx.de> wrote:
> >>>
> >>>> Hi Daniel,
> >>>> actually you are running the profile inside tomcat. You should take
a
> >>>> snapshot and then drill down to the functions where the actual
> >>>> recommendation takes place. The current screenshots also contains some
> >>>> profiles from Tomcat threads which are sleeping a lot and therefore
> >>> taking
> >>>> a lot of time.
> >>>>
> >>>> Further the screenshots does not contain the amount how often the
> >>>> different functions are called.
> >>>>
> >>>> You have to profile multiple requests alone. The CacheItemSimilarity
> gets
> >>>> filled therefore it should go faster and faster.
> >>>>
> >>>> On 01.12.2011, at 15:11, Daniel Zohar wrote:
> >>>>
> >>>>> @Manuel thanks for the tips. I have installed VisualVM and followed
> are
> >>>> the
> >>>>> results
> >>>>> I did two sampling -
> >>>>> - With the optimized SamplingCandidateItemsStrategy (
> >>>>> http://pastebin.com/6n9C8Pw1):
> >>> http://static.inky.ws/image/934/image.jpg
> >>>>> - Without the optimized SamplingCandidateItemsStrategy:
> >>>>> http://static.inky.ws/image/935/image.jpg
> >>>>>
> >>>>
> >>>> The big hot spot is the function FastIDSet.find():
> >>>>
> >>>> Optimized: 13,759 s
> >>>> Unoptimized: 246,487 s
> >>>>
> >>>> So you see that your optimization already got you a performance boost
> of
> >>>> 2000%.
> >>>>
> >>>> Did you play around with the CacheItemSimilarity cache sizes?
> >>>>
> >>>> /Manuel
> >>>>
> >>>> --
> >>>> Manuel Blechschmidt
> >>>> Dortustr. 57
> >>>> 14467 Potsdam
> >>>> Mobil: 0173/6322621
> >>>> Twitter: http://twitter.com/Manuel_B
> >>>>
> >>>>
> >>>
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message