mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: SSVD too slow to handle large matrix?
Date Fri, 14 Sep 2012 21:36:31 GMT
also you can compare your performance experiments to Nathan Halko's
here: http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf
pp. 110+...

They attempted a very large problems, as much as 726 splits by 512mb
with -k 100.  (default split size is what... 64mb?) They had a problem
tuning ABt job (as expected -- it looks like they had incredible
memory starvation and GC thrashing to do it quite efficiently) but
even that I am not quite sure if that was before performance patches
for ABt job. That problem it looks like took them almost a day to run
thru with -q1 -- and again, that mostly because ABt multiplication.
Extremely sparse problems will produce more problems for ABt whereas
densier problems are less prone to problems with q>0.

-d

On Fri, Sep 14, 2012 at 2:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> most importantly, what's your number of non-zero elements. (or input
> sequence file size).
>
> On Fri, Sep 14, 2012 at 2:19 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>> Q job is actually the fastest and map-only.I'd say you drop all the
>> optional parameters (including p) and use mahout 0.7.
>>
>> Actually reducing split size is unlikely to help. Default split should be fine.
>>
>> i'd say running -k 10 on any sized input should result in Q mapper
>> task running in at most couple of minutes.
>>
>> using -k200 -p100 is fairly ambitious (mapper task running time will
>> scale a little worse then proportional to k+p).
>>
>> if you use -q1 you will likely to have more problems with ABt job and
>> that may require some memory tuning...
>>
>> otherwise check the usual things -- memory, cluster capacity (do you
>> actually have capacity running 100 mappers? Do they have at least 1G
>> of RAM on -Xmx without scratching the swap? Are you seeing GC
>> thrashing? etc.)
>>
>> That said your problem doesn't seem too big (judging from 100 mappers
>> with a regular split size, that should be ok). with -k 100 and default
>> p you should expect single q task to run about 20-25 minutes,
>> depending on your hardware. It is cpu-bound (or rather, mostly
>> fpu-bound, assuming you tackled memory issues etc.)
>>
>>
>> On Fri, Sep 14, 2012 at 1:24 PM, lei tang <find.ltang@gmail.com> wrote:
>>> Hi,
>>>
>>> I am using mahout's  SSVD (stochastic SVD) to factorize a huge sparse
>>> matrix (around 30M x 1M).    I used a modified script of
>>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>>> to store the input matrix with <key, value> pairs being integer, and
>>> vectorwritable (in particular, SequentialAccessSparseVector). Should I
>>> change to RandomAccessSparseVector?
>>>
>>> I managed to run mahout SSVD with the following specification.
>>> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o
>>> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir mf/tmp
>>>
>>> I specified the max split in order to have more mappers.  However, the
>>> first Qjob seems not moving. After 1 hour, it is still 12% with 100
>>> mappers.  Is this expected?  Should I change any parameter?
>>>
>>> Any suggestion is highly appreciated.
>>>
>>> - Lei
>>> P.S.  I'm also reading the docs from
>>> https://issues.apache.org/jira/browse/MAHOUT-376  in hope that I can figure
>>> out why it is so slow.

Mime
View raw message