mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: svd, cleansvd, and expected output
Date Fri, 09 Jul 2010 22:58:03 GMT
Hey Erik,

  The order of output is fairly arbitrary: Lanczos gets both the very
largest and very smallest eigenvalues at the same time, and the order you
decide to store those is pretty much up to you (do you want the biggest 20
eigenvalues/vectors - useful for making low-rank approximations, or do you
want the smallest eigens - useful for cutting a graph into almost
disconnected clusters).

  You have to be careful in choosing via looking for an elbow - if it's near
the middle (desiredRank / 2) you could be missing a bunch of good
eigenvalues right above the elbow.  In your case, it's looking like you've
really got closer to 20 or so good large eigenvalues, and 20/200 is small
enough that you most likely have all 20 of the largest 20, so cutting there
seems pretty reasonable to me.  If your elbow was in the middle, you would
have to just bump up desiredRank, then throw away more.

  For the cleansvd job, I'd run it first with really lax requirements
(minEigenvalue 0, maxError 0.5), and see what the errors look like.  Some of
the eigenvectors you've gotten are really errors, and will fail the maxError
test with errors of 0.99 or higher, and will get discarded.  If you know
(based on your first run) what eigenvalue is at the elbow, you can just set
the minEigenvalue to be this, and you'll cut off everything below that, too.
 I *think* that cleansvd spits out the final eigen-pairs in the descending
order you want, but try and see.

  Let me know if that works out for you!

  -jake

On Sat, Jul 10, 2010 at 12:34 AM, Erik Frey <erik@wavii.com> wrote:

> Hi all,
>
> I recently ran mahout's svd on a large text corpus following the helpful
> example written here:
> https://cwiki.apache.org/MAHOUT/dimensionalreduction.html
>
> Just a few questions about how I should best interpret the output:
>
> * I chose to calculate 200 singular vectors - as the driver was finishing
> up it printed out the eigenvalues and I was surprised to see them in
> ascending order.  The first singular vector had an eigenvalue of zero, there
> was an elbow at ~dimension 180, and a sharp incline towards an eigenvalue of
> 1.0 at dimension 199.  I was expecting these to be in declining order.  Did
> I do something wrong?
>
> * Usually when choosing the number of dimensions I'd chop off at the elbow,
> but cleansvd seems to have a number of more specific options.  Assuming my
> first run has gone correctly, are there rules of thumb I should follow for
> picking the min eigenvalue and max error?
>
> Thanks,
>
> Erik

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message