mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: SVD and input args
Date Mon, 05 Jul 2010 23:14:23 GMT
On Mon, Jul 5, 2010 at 2:59 PM, Grant Ingersoll <gsingers@apache.org> wrote:

> Trying out SVD for the first time and trying to make sense of the
> parameters...
>
> Am I missing a more obvious way to get the number of rows to give to SVD
> than to iterate through the whole sequence file of vectors and count them
> up?


Pretty much.  But you can also integrate that task into the production of
the vectors.


> Assuming a sufficiently large vector file, don't I need a M/R job to do
> this?  Likewise, one would have to do this for the --numCols as well, right?
>  In reality, I suppose it would be useful to have a utility that checked to
> make sure all the vectors in a file are the same cardinality, right?
>

Yes and no.  The number of rows should be the number of documents you
vectorized.  The number of columns should be the number of distinct terms
that you observed in vectorizing.  Both should be pretty easily available.
 With sparse vectors, we don't care quite as much about the size of the
vector and often set it to a "large enough" value.

The other major approach is to use random projection to get fixed length
vectors of known and predetermined size out.  This is the strategy I use in
the SGD code and it makes a lot of things much, much easier because you can
set the cardinality of the vectors involved ahead of time.  IT makes
converting a vector back into terms much harder, though.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message