I don't think anybody has done anything on quite that scale, though Jake may
have come relatively close.
There are several scaling limits. These include:
 the total number of nonzero elements. This drives the scan time and, to
some extent the cost of the multiplies.
 the total number of singular vectors desired. This directly drives the
number of iterations in the Hebbian approach and drives the size of
intermediate products in the random projection techniques. It also causes
product scaling with the next factor.
 the number of columns in the original matrix. This, multiplied by the
number of singular vectors drives the memory cost of some approaches in the
final step or in the SVD step for the random projection.
 the number of rows in the original matrix. This is a secondary factor
that can drive some intermediate products in the random projection.
Which of these will hang you up in your problem is an open question. There
is always the factor I haven't thought about yet.
Jake, do you have any thoughts on this?
On Sat, Sep 4, 2010 at 5:08 PM, Akshay Bhat <akshayubhat@gmail.com> wrote:
> Hello,
> Has anyone attempted SVD of a with a really large matrix (~40 million rows
> and columns to be specific) using mahout.
> I am planning to perform SVD using mahout on Twitter Follower network (it
> contains information about ~35 Million users following ~45 million users
> http://an.kaist.ac.kr/traces/WWW2010.html ) and I should have access to
> Cornell hadoop cluster (55 Quad core nodes with 1618GB ram per node). Can
> anyone estimate how long the job will run?
> Also is it possible to perform regularized SVD, or will I need to add
> functionality by modifying the code.
> Thank you
>
>
> 
> Akshay Uday Bhat.
> Graduate Student, Computer Science, Cornell University
> Website: http://www.akshaybhat.com
>
