mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Scalability of ParallelALSFactorizationJob with implicit feedback
Date Mon, 11 Jun 2012 19:28:08 GMT
Decomposition techniques do help with this, but it sounds like the OP was
using one of the cooccurrence based techniques.

On Mon, Jun 11, 2012 at 12:19 PM, Sean Owen <srowen@gmail.com> wrote:

> Not so with ALS. The matrix in question is (# users) x (# features), so the
> number of rated items by any user won't matter.
>
> I didn't write this job, but implemented a similar pipeline. I struggled
> with this kind of tradeoff: loading the user feature matrix in memory is a
> scalability bottleneck (straings worker memory) but makes things go much,
> much faster.
>
> Computing U' * U is cake in memory. (Computing U * U' is not possible here!
> but I don't think the job ever tries that? Shouldn't...)
>
> U is the biggest thing you need in memory at any given time. And it is
> going to need about 0.5KB per user at most. 10M users = 5GB RAM. Meh, that
> seems roughly "OK".
>
>
> If you like ALS on Hadoop, I don't mind again plugging the Myrrix
> Computation Layer (http://myrrix.com/documentation-computation-layer/), a
> sort of spin off of this kind of work I've been doing in Mahout (though not
> this class) that I've done a lot to optimize. I think it's about as swift
> as this will be on Hadoop -- and ALS does fit Hadoop quite well. Email
> off-list if you want to try it out.
>
>
> On Mon, Jun 11, 2012 at 8:08 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > This sounds like you have a pathological user (or several) in your data
> > set.
> >
> > The cost of these jobs scales as the square of the activity of the most
> > active user.  This means that you typically need to eliminate this user
> (if
> > they are robots or QA) or down-sample them (if they are just crazy people
> > who download thousands and tens of thousands of things).  This generally
> > causes no perceptible impact on performance.
> >
> > The system should easily scale to the size you need with a bit of care in
> > the data.
> >
> > On Mon, Jun 11, 2012 at 11:42 AM, Bill Mccormick <billmcc64@gmail.com
> > >wrote:
> >
> > > Hi all,
> > >
> > > We're interested in using Mahout for a recommendation system for a
> > largish
> > > online storefront.
> > >
> > > The initial recommendations are based on download/purchase history, so
> we
> > > were trying out the ParallelALSFactorizationJob which seems to give
> good
> > > results.
> > >
> > > The initial test run was limited to 100,000 users and the job ran with
> no
> > > problems.
> > >
> > > The next test set was structured differently with around 4M download
> > > records and around 1.5 M users (rather than a fixed number of users, it
> > was
> > > the set of downloads over a fixed period of time).   The Hadoop tasks
> > hung
> > > in garbage collection on this job.
> > >
> > > I started looking at memory usage, and I noticed that the existing
> > > implementation attempts to compute the product of the user factor
> matrix
> > > transpose with itself in memory.  (It also looks like it does this on
> > every
> > > mapper, instead of once per iteration.)
> > >
> > > Our full data set has on the order of 100M users.    So this isn't
> going
> > to
> > > work as is.  (i.e. the user factor matrix will take 100M users x 20
> > factors
> > > x 8 bytes per entry = 16 Gbytes)
> > >
> > > I'm just pondering implementing a new version that does the large
> matrix
> > > computations in a less memory intensive fashion.   Before I go too
> far, I
> > > was hoping this list could provide some input:
> > >
> > > - is my analysis correct?
> > > - is someone already working on this?
> > > - if we go ahead with this, is the Mahout project interested in
> accepting
> > > the new implementation once it's done?
> > >
> > > thank you very much.
> > >
> > > --
> > > Bill McCormick
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message