mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: is it possible to compute the SVD for a large scale matrix
Date Wed, 06 Apr 2011 18:26:11 GMT
so, assuming 500 oversampled svalues is equivalent to perhaps 300
'good' values.... depending on decay... so 300 singular values would
require 300 passes over the whole input? or only sub-part of it?
Given it takes about 20 s just to set up a MR run and 10 sec to
confirm it's completion, that's just what... about 100-150 minutes
just in initialization time?

Also, the size of the problem must also affect sorting i/o time
(unless all jobs are map-only, but i don't think they can be). That's
kind of at least proportional to the size of the input. so I guess
problem size does matter, not just the # of available slots for the

On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix <> wrote:
> Hmmm... that's a really tiny data set.  Lanczos-based SVD, for k singular
> values, requires k passes over the data, and each row which has d non-zero
> entries will do d^2 computations in each pass.  So if there are n rows in
> the
> data set, it's k*n*d^2 if all rows are the same size.
> I guess "how long" depends on how big the cluster is!
> On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov <> wrote:
>> Jake, since we are on the topic, what's the running times of Lanczos
>> on a ~1G worth sequence file input might be?
>> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix <>
>> wrote:
>> >
>> >
>> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov <>
>> > wrote:
>> >>
>> >> you can certainly try to write it out into a DRM (distributed row
>> >> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
>> >> MAHOUT-593. This is suitable if you have a good decay of singular
>> >> values (but if you don't it probably just means you have so much noise
>> >> that it masks the problem you are trying to solve in your data).
>> >
>> > You don't need to run it as stochastic, either.  The regular
>> > LanczosSolver
>> > will work on this data, if it lives as a DRM.
>> >
>> >   -jake

View raw message