mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Scalability of ParallelALSFactorizationJob with implicit feedback
Date Mon, 11 Jun 2012 19:08:00 GMT
This sounds like you have a pathological user (or several) in your data set.

The cost of these jobs scales as the square of the activity of the most
active user.  This means that you typically need to eliminate this user (if
they are robots or QA) or down-sample them (if they are just crazy people
who download thousands and tens of thousands of things).  This generally
causes no perceptible impact on performance.

The system should easily scale to the size you need with a bit of care in
the data.

On Mon, Jun 11, 2012 at 11:42 AM, Bill Mccormick <billmcc64@gmail.com>wrote:

> Hi all,
>
> We're interested in using Mahout for a recommendation system for a largish
> online storefront.
>
> The initial recommendations are based on download/purchase history, so we
> were trying out the ParallelALSFactorizationJob which seems to give good
> results.
>
> The initial test run was limited to 100,000 users and the job ran with no
> problems.
>
> The next test set was structured differently with around 4M download
> records and around 1.5 M users (rather than a fixed number of users, it was
> the set of downloads over a fixed period of time).   The Hadoop tasks hung
> in garbage collection on this job.
>
> I started looking at memory usage, and I noticed that the existing
> implementation attempts to compute the product of the user factor matrix
> transpose with itself in memory.  (It also looks like it does this on every
> mapper, instead of once per iteration.)
>
> Our full data set has on the order of 100M users.    So this isn't going to
> work as is.  (i.e. the user factor matrix will take 100M users x 20 factors
> x 8 bytes per entry = 16 Gbytes)
>
> I'm just pondering implementing a new version that does the large matrix
> computations in a less memory intensive fashion.   Before I go too far, I
> was hoping this list could provide some input:
>
> - is my analysis correct?
> - is someone already working on this?
> - if we go ahead with this, is the Mahout project interested in accepting
> the new implementation once it's done?
>
> thank you very much.
>
> --
> Bill McCormick
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message