Yes you got it, thats what i was proposing before. A very easy patch.
On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pat.ferrel@gmail.com> wrote:
> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>
> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
> proportions for doing clustering and similarity (though not sure if this is
> strictly required with cosine distance) I probably want to use U * Sigma
> instead of sqrt Sigma.
>
> Since I have no other reason to load U row by row I could write another
> transform and keep it out of the mahout core but doing this in a patch
> seems trivial. Just create a new flag, something like uSigma (the CLI
> option looks like the hardest part actually). For the API there needs to be
> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
> extra flag check in the setup for the UJob, something like the following
>
> if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
> from uSigma option or SSVDSolver#setComputeUSigma(true)
> sValues = SSVDHelper.loadVector(sigmaPath,
> context.getConfiguration());
> // sValues.assign(Functions.SQRT); // no need to take the sqrt
> for Sigma weighting
> }
>
> sValues is already applied to U in the map, which would remain unchanged
> since the sigma weighted (instead of sqrt sigma) values will already be in
> sValues.
>
> if (sValues != null) {
> for (int i = 0; i < k; i++) {
> uRow.setQuick(i,
> qRow.dot(uHat.viewColumn(i)) *
> sValues.getQuick(i));
> }
> } else {
> …
>
> I'll give this a try and if it seems reasonable submit a patch.
>
> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> >
> > When using PCA it's also preferable to use uHalfSigma to create U with
> the SSVD solver. One difficulty is that to perform the multiplication you
> have to turn the singular values vector (diagonal values) into a
> distributed row matrix or write your own multiply function, correct?
>
> You could do that, but why? Sigma is a diagonal matrix (which
> additionally encoded as a very short vector of singular values of
> length k, say we denote it as 'sv'). Given that, there's absolutely 0
> reason to encode it as Distributed row matrix.
>
> Multiplication can be done on the fly as you load U, row by row:
> U*Sigma[i,j]=U[i,j]*sv[j]
>
> One inconvenience with that approach is that it assumes you can freely
> hack the code that loads U matrix for further processing.
>
> It is much easier to have SSVD to output U*Sigma directly using the
> same logic as above (requires a patch) or just have it output
> U*Sigma^0.5 (does not require a patch).
>
> You could even use U in some cases directly, but part of the problem
> is that data variances will be normalized in all directions compared
> to PCA space, which will affect actual distances between data points.
> If you want to retain proportions of the directional variances as in
> your original input, you need to use principal components with scaling
> applied, i.e. U*Sigma.
>
>
>
