mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Anonymous rows in clusters after SSVD
Date Sun, 09 Sep 2012 18:09:10 GMT
Regarding SSVD + clustering

I tried the command line version of kmeans on U*Sigma and don't get row IDs in clusteredPoints
there either. Using the command line kmeans on the input matrix A does generate row IDs. There
must be some difference in the two that causes this to happen. 

I used seq2sparse to create the NamedVectors and rowid to turn them into a DRM = A. Rowid
creates a file "docIndex" which maps the row IDs of A (actually Keys in the vector TFIDF files)
so does not put NamedVectors into A, relying on Keys to identify rows. Then kmeans on A creates
row IDs in clusteredPoints.

Using the output of SSVD = U*Sigma as input to the same command line version of kmeans produces
no row IDs in "clusterePoints". As I said earlier this makes it impossible to tie clustered
vectors back to pre-SSVD input vectors.

This leads me to think there is some significant difference between A and U*Sigma, which is
causing this. It looks like both A and U*Sigma are <IntWritable, VectorWritable>. So
I need to dig deeper.
View raw message