SSVD for 50 values actually would take significantly less time that
for 500. actually ~500^2/50^2 times faster, i think, as the flops are
in QR computation for most part. I did not try to run it with such
small k+p values though -- not on bigger inputs anyway.
On Wed, Apr 6, 2011 at 11:39 AM, Ted Dunning wrote:
> Yes. It would take a long time.
> But, on the other side of the discussion, it is unlikely that any singular
> vectors that you get past 20-50 (depends on the problem)
> will be anything but elaborate encodings of noise anyway. For lots of
> problems, a very small number of real singular vectors plus
> a bunch of random numbers will suffice just as well. So I wouldn't expect
> more than 50 passes would ever be needed.
> Lots of people have studied the problem of how performance improves with
> larger numbers of reduced dimension, but few have
> studied the problem properly by looking at the trade-off singular vectors
> versus random vectors. My guess is that most systems
> would work well with no more than 10 singular vectors plus 30-40 random
> projections.
>
> On Wed, Apr 6, 2011 at 11:26 AM, Dmitriy Lyubimov wrote:
>>
>> so, assuming 500 oversampled svalues is equivalent to perhaps 300
>> 'good' values.... depending on decay... so 300 singular values would
>> require 300 passes over the whole input? or only sub-part of it?
>> Given it takes about 20 s just to set up a MR run and 10 sec to
>> confirm it's completion, that's just what... about 100-150 minutes
>> just in initialization time?
>>
>> Also, the size of the problem must also affect sorting i/o time
>> (unless all jobs are map-only, but i don't think they can be). That's
>> kind of at least proportional to the size of the input. so I guess
>> problem size does matter, not just the # of available slots for the
>> mappers.
>>
>>
>> On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix
>> wrote:
>> > Hmmm... that's a really tiny data set. Lanczos-based SVD, for k
>> > singular
>> > values, requires k passes over the data, and each row which has d
>> > non-zero
>> > entries will do d^2 computations in each pass. So if there are n rows
>> > in
>> > the
>> > data set, it's k*n*d^2 if all rows are the same size.
>> > I guess "how long" depends on how big the cluster is!
>> >
>> > On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov
>> > wrote:
>> >>
>> >> Jake, since we are on the topic, what's the running times of Lanczos
>> >> on a ~1G worth sequence file input might be?
>> >>
>> >> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix
>> >> wrote:
>> >> >
>> >> >
>> >> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov
>> >> >
>> >> > wrote:
>> >> >>
>> >> >> you can certainly try to write it out into a DRM (distributed row
>> >> >> matrix) and run stochastic SVD on hadoop (off the trunk now). see
>> >> >> MAHOUT-593. This is suitable if you have a good decay of singular
>> >> >> values (but if you don't it probably just means you have so much
>> >> >> noise
>> >> >> that it masks the problem you are trying to solve in your data).
>> >> >
>> >> > You don't need to run it as stochastic, either. The regular
>> >> > LanczosSolver
>> >> > will work on this data, if it lives as a DRM.
>> >> >
>> >> > -jake
>> >
>> >
>
>