mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akshay Bhat <akshayub...@gmail.com>
Subject Re: Regarding the scalability of SVD code in Mahout
Date Wed, 08 Sep 2010 01:16:46 GMT
Thanks Ted,

On Sun, Sep 5, 2010 at 2:05 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I don't think anybody has done anything on quite that scale, though Jake
> may
> have come relatively close.
>
> There are several scaling limits.  These include:
>
> - the total number of non-zero elements.  This drives the scan time and, to
> some extent the cost of the multiplies.
>
> The total number of non-zero elements are small since, most of the twitter
users follow on average around 100 other users


> - the total number of singular vectors desired.  This directly drives the
> number of iterations in the Hebbian approach and drives the size of
> intermediate products in the random projection techniques.  It also causes
> product scaling with the next factor.
>
I plant to calculate around 50-200 singular vectors

>
> - the number of columns in the original matrix.  This, multiplied by the
> number of singular vectors drives the memory cost of some approaches in the
> final step or in the SVD step for the random projection.
>
The number of columns in the matrix are ~ 47 million


> - the number of rows in the original matrix.  This is a secondary factor
> that can drive some intermediate products in the random projection.
>
> The number of rows is around 35 Million


> Which of these will hang you up in your problem is an open question.  There
> is always the factor I haven't thought about yet.
>
> Jake, do you have any thoughts on this?
>
>
I believe that the twitter data set would be good stress test for the SVD
algorithm.
I should hopefully get access to cluster by next week.

On Sat, Sep 4, 2010 at 5:08 PM, Akshay Bhat <akshayubhat@gmail.com> wrote:
>
> > Hello,
> > Has anyone attempted SVD of a with a really large matrix (~40 million
> rows
> > and columns to be specific) using mahout.
> > I am planning to perform SVD using mahout on Twitter Follower network (it
> > contains information about ~35 Million users following ~45 million users
> > http://an.kaist.ac.kr/traces/WWW2010.html ) and I should have access to
> > Cornell hadoop cluster (55 Quad core nodes with 16-18GB ram per node).
> Can
> > anyone estimate how long the job will run?
> > Also is it possible to perform regularized SVD, or will I need to add
> > functionality by modifying the code.
> > Thank you
> >
> >
> > --
> > Akshay Uday Bhat.
> > Graduate Student, Computer Science, Cornell University
> > Website: http://www.akshaybhat.com
> >
>

Thanks

-- 
Akshay Uday Bhat.
Graduate Student, Computer Science, Cornell University
Website: http://www.akshaybhat.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message