mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: SSVD too slow to handle large matrix?
Date Fri, 14 Sep 2012 21:23:08 GMT
most importantly, what's your number of non-zero elements. (or input
sequence file size).

On Fri, Sep 14, 2012 at 2:19 PM, Dmitriy Lyubimov <> wrote:
> Q job is actually the fastest and map-only.I'd say you drop all the
> optional parameters (including p) and use mahout 0.7.
> Actually reducing split size is unlikely to help. Default split should be fine.
> i'd say running -k 10 on any sized input should result in Q mapper
> task running in at most couple of minutes.
> using -k200 -p100 is fairly ambitious (mapper task running time will
> scale a little worse then proportional to k+p).
> if you use -q1 you will likely to have more problems with ABt job and
> that may require some memory tuning...
> otherwise check the usual things -- memory, cluster capacity (do you
> actually have capacity running 100 mappers? Do they have at least 1G
> of RAM on -Xmx without scratching the swap? Are you seeing GC
> thrashing? etc.)
> That said your problem doesn't seem too big (judging from 100 mappers
> with a regular split size, that should be ok). with -k 100 and default
> p you should expect single q task to run about 20-25 minutes,
> depending on your hardware. It is cpu-bound (or rather, mostly
> fpu-bound, assuming you tackled memory issues etc.)
> On Fri, Sep 14, 2012 at 1:24 PM, lei tang <> wrote:
>> Hi,
>> I am using mahout's  SSVD (stochastic SVD) to factorize a huge sparse
>> matrix (around 30M x 1M).    I used a modified script of
>> to store the input matrix with <key, value> pairs being integer, and
>> vectorwritable (in particular, SequentialAccessSparseVector). Should I
>> change to RandomAccessSparseVector?
>> I managed to run mahout SSVD with the following specification.
>> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o
>> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir mf/tmp
>> I specified the max split in order to have more mappers.  However, the
>> first Qjob seems not moving. After 1 hour, it is still 12% with 100
>> mappers.  Is this expected?  Should I change any parameter?
>> Any suggestion is highly appreciated.
>> - Lei
>> P.S.  I'm also reading the docs from
>>  in hope that I can figure
>> out why it is so slow.

View raw message