mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Scalability of ParallelALSFactorizationJob with implicit feedback
Date Wed, 20 Jun 2012 18:22:00 GMT
PS on this point, I've had great success locally by keeping the
user-feature or item-feature matrix out of memory, and on disk in the
form of a Hadoop MapFile, and then putting a large cache on top.
Caching works quite well since the number of items (users) that appear
frequently is small. You pay extra overhead for all the lookups in the
MapFile, but that can be helped somewhat by tuning its indexing
factor. In the end I have found you save a surprising amount of time
vs loading the whole thing just because of the frequency distribution
-- many users/items, even most, need never be touched by a reducer.

On Mon, Jun 11, 2012 at 8:19 PM, Sean Owen <> wrote:
> If you like ALS on Hadoop, I don't mind again plugging the Myrrix
> Computation Layer (, a
> sort of spin off of this kind of work I've been doing in Mahout (though not
> this class) that I've done a lot to optimize. I think it's about as swift as
> this will be on Hadoop -- and ALS does fit Hadoop quite well. Email off-list
> if you want to try it out.
>> > I started looking at memory usage, and I noticed that the existing
>> > implementation attempts to compute the product of the user factor matrix
>> > transpose with itself in memory.  (It also looks like it does this on
>> > every
>> > mapper, instead of once per iteration.)
>> >
>> > Our full data set has on the order of 100M users.    So this isn't going
>> > to
>> > work as is.  (i.e. the user factor matrix will take 100M users x 20
>> > factors
>> > x 8 bytes per entry = 16 Gbytes)
>> >
>> > I'm just pondering implementing a new version that does the large matrix
>> > computations in a less memory intensive fashion.   Before I go too far,
>> > I
>> > was hoping this list could provide some input:

View raw message