mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Brickley <>
Subject Re: 'bin/mahout rowid' failing
Date Sat, 08 Oct 2011 23:11:53 GMT
On 8 October 2011 23:58, Ted Dunning <> wrote:
> On Sat, Oct 8, 2011 at 12:43 PM, Dan Brickley <> wrote:
>> ...
>> ...and I get as expected, a few less than 100 due to the cleaning (88).
>> Each
>> of these has 27683 values, which is the number of topic codes in my data.
>> I'm reading this (correct me if I have this backwards) as if my topics are
>> now data points positioned in a new compressed version of a 'book space'.
>> What I was after was instead 100000 books in a new lower-dimensioned 'topic
>> space' (can I say this as: I want left singular vectors but I'm getting
>> right singular vectors?). Hence the attempt to transpose and rerun Lanczos;
>> I thought this the conceptually simplest if not most efficient way to get
>> there. I understand there are other routes but expected this one to work.
> I don't know the options to Lanczos as well as I should.  Your idea that you
> are getting right vectors only is correct and the idea to transpose in order
> to get the left vectors is sound.

Thanks, Ted.

> I think that there is an option to get the left vectors as well as the right ones.

That would be handy.

bin/mahout svd --help gives no clue of such an option.

When I asked about this a while back (having naively expected 3
matrices back per textbook SVD) your response was
"Generally the SVD in these sorts of situations does not return the
entire set of three matrices.  Instead either the left or right (but
usually the right) eigenvectors premultiplied by the diagonal or the
square root of the diagonal element." and "You can multiply the
original matrix by the transpose of the available eigenvectors and the
inverse of the eigenvalues to get the missing eigenvectors."

(so I figured transposing the input and re-running much safer=simpler
than trying to figure out how to take inverse of the eigenvalues, at
least until I find my feet. But if there's a simple set of
'bin/mahout' calls for this, it would be great to document them in the

> Also, while you are at it, I think that the code in MAHOUT-792 might be able
> to do these decompositions at your scale much, much faster since they use an
> in-memory algorithm on a single machine to avoid all the kerfuffle of
> running a map-reduce.

These tests are with just 100k entries. The full collection is
somewhat over 12 or so million book records, which I'd assumed to be
Hadoop territory. I tried loading a MySQL dump of that on my laptop
but gave up after a couple weeks :) So thinking is to get a feel for
things with 100k then have a go with the whole lot, at which point
map-reduce should earn its keep. FWIW the running time for Lanczos on
MAHOUT_LOCAL mode MacBook Pro was reasonably painless. But I'll take a
look at MAHOUT-792 for sure.

Back on the original theme: could someone sanity check for me, and try
running 'MAHOUT_LOCAL=true mahout rowid --help' on a recent trunk
built. Since I don't believe my installation should be weird, I'd love
some confirmation that this is OK for others.



View raw message