mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: SVD and input args
Date Tue, 06 Jul 2010 06:24:34 GMT
It turns out that the number of rows isn't actually used in the SVD code at
all (you can put in any number for this parameter), but this is an artifact
of the particular choice of spitting out only the right singular vectors.
 NumCols is indeed necessary, but there's an ugly trick to figure it out
too: run it with numCols = anything, and the first time you run, you'll get
an exception which tells you what the cardinality of the vectors are.  This
is the true numCols to use.

This should probably be fixed, as this is ugly as sin.  Easy fix is: remove
numRows (add back when they become necessary, if ever), and make numCols
optional, calculating it on the fly by fetching the first chunk of the
SequenceFile from HDFS and finding out the dim of the vector.

Glad to see some more other committers playing with the SVD code finally - I
should have pretended I left those hacks in on purpose specifically to see
when y'all would use it and mention how horrible it was. :P

  -jake

On Mon, Jul 5, 2010 at 11:59 PM, Grant Ingersoll <gsingers@apache.org>wrote:

> Trying out SVD for the first time and trying to make sense of the
> parameters...
>
> Am I missing a more obvious way to get the number of rows to give to SVD
> than to iterate through the whole sequence file of vectors and count them
> up?  Assuming a sufficiently large vector file, don't I need a M/R job to do
> this?  Likewise, one would have to do this for the --numCols as well, right?
>  In reality, I suppose it would be useful to have a utility that checked to
> make sure all the vectors in a file are the same cardinality, right?
>
> Just trying to get my head around the practical side of running SVD.
>
>
> Thanks,
> Grant

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message