so, assuming 500 oversampled svalues is equivalent to perhaps 300
'good' values.... depending on decay... so 300 singular values would
require 300 passes over the whole input? or only subpart of it?
Given it takes about 20 s just to set up a MR run and 10 sec to
confirm it's completion, that's just what... about 100150 minutes
just in initialization time?
Also, the size of the problem must also affect sorting i/o time
(unless all jobs are maponly, but i don't think they can be). That's
kind of at least proportional to the size of the input. so I guess
problem size does matter, not just the # of available slots for the
mappers.
On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix <jake.mannix@gmail.com> wrote:
> Hmmm... that's a really tiny data set. Lanczosbased SVD, for k singular
> values, requires k passes over the data, and each row which has d nonzero
> entries will do d^2 computations in each pass. So if there are n rows in
> the
> data set, it's k*n*d^2 if all rows are the same size.
> I guess "how long" depends on how big the cluster is!
>
> On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>
>> Jake, since we are on the topic, what's the running times of Lanczos
>> on a ~1G worth sequence file input might be?
>>
>> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix <jake.mannix@gmail.com>
>> wrote:
>> >
>> >
>> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> > wrote:
>> >>
>> >> you can certainly try to write it out into a DRM (distributed row
>> >> matrix) and run stochastic SVD on hadoop (off the trunk now). see
>> >> MAHOUT593. This is suitable if you have a good decay of singular
>> >> values (but if you don't it probably just means you have so much noise
>> >> that it masks the problem you are trying to solve in your data).
>> >
>> > You don't need to run it as stochastic, either. The regular
>> > LanczosSolver
>> > will work on this data, if it lives as a DRM.
>> >
>> > jake
>
>
