mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Frey <>
Subject Re: svd, cleansvd, and expected output
Date Wed, 14 Jul 2010 18:48:36 GMT
Hi Jake,

This worked.  cleansvd removed a handful of vectors and the output was descending by weight.
 So far, everything looks good!



On Jul 9, 2010, at 3:58 PM, Jake Mannix wrote:

> Hey Erik,
>  The order of output is fairly arbitrary: Lanczos gets both the very
> largest and very smallest eigenvalues at the same time, and the order you
> decide to store those is pretty much up to you (do you want the biggest 20
> eigenvalues/vectors - useful for making low-rank approximations, or do you
> want the smallest eigens - useful for cutting a graph into almost
> disconnected clusters).
>  You have to be careful in choosing via looking for an elbow - if it's near
> the middle (desiredRank / 2) you could be missing a bunch of good
> eigenvalues right above the elbow.  In your case, it's looking like you've
> really got closer to 20 or so good large eigenvalues, and 20/200 is small
> enough that you most likely have all 20 of the largest 20, so cutting there
> seems pretty reasonable to me.  If your elbow was in the middle, you would
> have to just bump up desiredRank, then throw away more.
>  For the cleansvd job, I'd run it first with really lax requirements
> (minEigenvalue 0, maxError 0.5), and see what the errors look like.  Some of
> the eigenvectors you've gotten are really errors, and will fail the maxError
> test with errors of 0.99 or higher, and will get discarded.  If you know
> (based on your first run) what eigenvalue is at the elbow, you can just set
> the minEigenvalue to be this, and you'll cut off everything below that, too.
> I *think* that cleansvd spits out the final eigen-pairs in the descending
> order you want, but try and see.
>  Let me know if that works out for you!
>  -jake
> On Sat, Jul 10, 2010 at 12:34 AM, Erik Frey <> wrote:
>> Hi all,
>> I recently ran mahout's svd on a large text corpus following the helpful
>> example written here:
>> Just a few questions about how I should best interpret the output:
>> * I chose to calculate 200 singular vectors - as the driver was finishing
>> up it printed out the eigenvalues and I was surprised to see them in
>> ascending order.  The first singular vector had an eigenvalue of zero, there
>> was an elbow at ~dimension 180, and a sharp incline towards an eigenvalue of
>> 1.0 at dimension 199.  I was expecting these to be in declining order.  Did
>> I do something wrong?
>> * Usually when choosing the number of dimensions I'd chop off at the elbow,
>> but cleansvd seems to have a number of more specific options.  Assuming my
>> first run has gone correctly, are there rules of thumb I should follow for
>> picking the min eigenvalue and max error?
>> Thanks,
>> Erik

View raw message