mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Zohar <disso...@gmail.com>
Subject Re: Mahout performance issues
Date Thu, 01 Dec 2011 14:11:12 GMT
@Sean, yes I am using CachingItemSimilarity, and I can see that over time
performance is better.
@Manuel thanks for the tips. I have installed VisualVM and followed are the
results
I did two sampling -
- With the optimized SamplingCandidateItemsStrategy (
http://pastebin.com/6n9C8Pw1): http://static.inky.ws/image/934/image.jpg
- Without the optimized SamplingCandidateItemsStrategy:
http://static.inky.ws/image/935/image.jpg

@Sebastian, here are the two curves you asked for.
Item-users: http://static.inky.ws/image/932/image.jpg
User-items: http://static.inky.ws/image/932/image.jpg

I think from the above curves one can clearly see that a lot of my data is
not needed to be checked when looking for similar items. That's because if
a user had only a single choice in the past, there's no point of checking
for his other choices at all while doing item similarities.

I would think it's something that should be integrated into the DataModel.
Maybe there should be one Set that holds only users which had made more
than one choice. This will greatly improve performance in my case. What do
you think?

On Thu, Dec 1, 2011 at 10:52 AM, Manuel Blechschmidt <
Manuel.Blechschmidt@gmx.de> wrote:

> Hello,
>
> On 01.12.2011, at 09:37, Sebastian Schelter wrote:
>
> > Daniel, can you plot two curves showing the distribution of
> > interactions per user and the distribution of interactions per item? I
> > think we need to get a better picture of your data first.
> >
> > Generally I always recommend to use precomputed similarities. You can
> > still serve new users with realtime recommendations, the only
> > disadvantages are the higher complexity and a delayed inclusion of new
> > items.
>
> In this paper:
> Fast Online Learning through Offline Initialization for Time-sensitive
> Recommendation
> http://users.cs.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p703.pdf
> Deepak Agarwal et. al. describes a solution how to include new items
> quickly into the recommendations.
> This is used for personalizing the news stories on the yahoo start page.
>
> @Daniel: I would also recommend to profile your application with JVisualVM:
> http://visualvm.java.net/
>
> After I did this with my recommender. I figured out that the default cache
> size for item similarities was
> far to low. The details are described in this ticket:
> https://issues.apache.org/jira/browse/MAHOUT-905
>
>
> >
> > --sebastian
>
> /Manuel
>
> >
> > 2011/11/30 Sean Owen <srowen@gmail.com>:
> >> The simple answer is that:
> >>
> >> Mahout absorbed a non-distributed recommender project called Taste,
> which
> >> scales up to a point which may be sufficient for a lot of users. It
> >> certainly is a lot simpler. Yes it is realistic to do near-real-time
> >> recommendations, though it gets harder and harder and requires more
> tuning,
> >> tradeoffs and optimization as this thread shows.
> >>
> >> The rest, written from scratch, is almost all distributed and
> Hadoop-based,
> >> including distributed re-implementations of the same algorithms.
> >>
> >> On Wed, Nov 30, 2011 at 8:23 PM, Dan Beaulieu
> >> <danjacob.beaulieu@gmail.com>wrote:
> >>
> >>> Hi all, this is a tangent and can mostly be ignored by the people
> >>> interested in this problem.
> >>>
> >>> I'm new to Machine Learning and especially Mahout. Following this
> >>> discussion has made me a bit confused.
> >>> Isn't Mahout used for large datasets where it makes sense to
> distribute the
> >>> work? Why then isn't anyone pointing
> >>> out that the problem may be the use of one single Mahout node? Is it
> >>> because it's boolean based? Is it because the data set
> >>> isn't really that large?
> >>>
> >>> Even if for whatever reason a single node will do for this case, is it
> >>> really expected that the recommendation process would finish in less
> than
> >>> half a second?
> >>> This makes me think if that is the expectation then the data set is
> >>> actually small and Mahout might be overkill...
> >>>
> >>> What obvious piece of the Mahout puzzle am I missing?
> >>>
> >>> Thanks.
> >>>
> >>> Dan
> >>>
> >>> On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <srowen@gmail.com> wrote:
> >>>
> >>>> Have you used CachingItemSimilarity? That will hold common
> similarities
> >>> in
> >>>> memory. It's a lot easier than pre-computing and might help.
> >>>>
> >>>> I think something like your change is a good one (Sebastian what do
> you
> >>>> think) in that it gives you the ultimate lever to control how many
> >>>> candidates are evaluated. That ought to make it go as fast as you
> like,
> >>> but
> >>>> it trades off quality. Still I'd be really surprised if there's no
> viable
> >>>> middle ground -- this works fine at smaller scale, where 100s of
> >>> candidates
> >>>> are evaluated, perhaps, and you can use your lever to get to 100s of
> >>>> candidates at your scale too. Is that still both slow and inaccurate?
> >>>>
> >>>> On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <dissoman@gmail.com>
> >>> wrote:
> >>>>
> >>>>> I just tested the app with Mahout 0.6.
> >>>>> There seems to be a small performance improvement, but still
> >>>>> recommendations for the 'heavy users' take between 1-5 seconds.
> >>>>>
> >>>>>
> >>>>
> >>>
>
> --
> Manuel Blechschmidt
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message