mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Mahout SSVD is too slow for highly dimensional data
Date Mon, 10 Jun 2013 18:42:30 GMT
On Mon, Jun 10, 2013 at 11:28 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>wrote:

> what is requested rank? This guy will not scale w.r.t rank, only w.r.t
> input size. Reallistically you don't need k>100, p >15.
>
> What is the input size (A in Gb?)
>
>
> On Mon, Jun 10, 2013 at 5:31 AM, Yahia Zakaria <yahiawestlife@gmail.com>wrote:
>
>> Hi All
>>
>> I am running Mahout SSVD (trunk version) using pca option on Bag of Words
>> dataset (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words). This
>> dataset
>> have 8000000 instances (rows) and 100000 attributes (columns). Mahout SSVD
>> is too slow, it may take days to finish the first phase of SSVD (Q-Job) .
>> I
>> am running the code on a cluster of 16 machines, each one is 8 cores and
>> 32
>> GB memory. Moreover, the CPU and memory of the workers are not utilized at
>> all.
>
>
Also: This is suspicious. it is a cpu-bound job. (memory requirements are
quite modest though).

If your data are extremely sparse, and/or your hadoop input split large
enough so that map task receives more than what is specified -r (default
30,000) then it spills Q blocks on disk for the second pass. Which may be
more data if requested k is greater than average number of non-zero
elements per row. If you have enough memory, just bump up -r (or use
smaller hadoop splits).

but single most important think is still (k+p). the cpu flops scale at
about O((k+p)^1.5). Since hadoop splits linearly to input, it is not
possible to split w.r.t flop increase commanded by (k+p) without additional
custom splitting tricks. don't use (k+p)>100. Seriously. especially for
LSA. Whatever you do, your LSA input will already be sufficiently varying
w.r.t general human knowledge about concepts in it, so approximate
inference is quite sufficient here IMO.



> While running Mahout SSVD on smaller dataset (12500 rows and 5000
>> columns), it runs too fast, the job was finished in 2 minutes. Do you have
>> any idea why Mahout SSVD is too slow for high dimensional data ? and to
>> what extent that SSVD can work efficiently (with respect to the number of
>> rows and columns of the input matrix) ?
>>
>> Thanks
>> Yehia
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message