Yes. It would take a long time.
But, on the other side of the discussion, it is unlikely that any singular
vectors that you get past 2050 (depends on the problem)
will be anything but elaborate encodings of noise anyway. For lots of
problems, a very small number of real singular vectors plus
a bunch of random numbers will suffice just as well. So I wouldn't expect
more than 50 passes would ever be needed.
Lots of people have studied the problem of how performance improves with
larger numbers of reduced dimension, but few have
studied the problem properly by looking at the tradeoff singular vectors
versus random vectors. My guess is that most systems
would work well with no more than 10 singular vectors plus 3040 random
projections.
On Wed, Apr 6, 2011 at 11:26 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> so, assuming 500 oversampled svalues is equivalent to perhaps 300
> 'good' values.... depending on decay... so 300 singular values would
> require 300 passes over the whole input? or only subpart of it?
> Given it takes about 20 s just to set up a MR run and 10 sec to
> confirm it's completion, that's just what... about 100150 minutes
> just in initialization time?
>
> Also, the size of the problem must also affect sorting i/o time
> (unless all jobs are maponly, but i don't think they can be). That's
> kind of at least proportional to the size of the input. so I guess
> problem size does matter, not just the # of available slots for the
> mappers.
>
>
> On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
> > Hmmm... that's a really tiny data set. Lanczosbased SVD, for k singular
> > values, requires k passes over the data, and each row which has d
> nonzero
> > entries will do d^2 computations in each pass. So if there are n rows in
> > the
> > data set, it's k*n*d^2 if all rows are the same size.
> > I guess "how long" depends on how big the cluster is!
> >
> > On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >>
> >> Jake, since we are on the topic, what's the running times of Lanczos
> >> on a ~1G worth sequence file input might be?
> >>
> >> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix <jake.mannix@gmail.com>
> >> wrote:
> >> >
> >> >
> >> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> you can certainly try to write it out into a DRM (distributed row
> >> >> matrix) and run stochastic SVD on hadoop (off the trunk now). see
> >> >> MAHOUT593. This is suitable if you have a good decay of singular
> >> >> values (but if you don't it probably just means you have so much
> noise
> >> >> that it masks the problem you are trying to solve in your data).
> >> >
> >> > You don't need to run it as stochastic, either. The regular
> >> > LanczosSolver
> >> > will work on this data, if it lives as a DRM.
> >> >
> >> > jake
> >
> >
>
