mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Kmeans on SSVD output
Date Tue, 11 Sep 2012 17:44:50 GMT
Running kmeans on doc vectors turned into a DistributedRowMatrix works fine (no surprise).

But when I do an SSVD on the above input, then create U * Sigma, a DistributedRowMatrix (IntWritable,
VectorWritable) I get clusters in clusters-xx-final but in clusteredPoints the vectors have
no IDs. Therefor the clustered points cannot be tied back to the clusters that contain them
and can't be tied to the original input documents???? 

To my eye the two input matrices look the same except for the weights but A is a sparse matrix
and U is a dense matrix, not sure if this matters… Also performing rowsimilarity on the
two matrices produces correct results with vector IDs in the output so there is something
special about kmeans?

===================================================================

Below are seqdumper snippets for clusteredPoints created from A and U * Sigma

clusteredPoints from kmeans on raw doc vectors turned into a DRM  (DRM A) 

Input Path: b/clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 810: Value: 1.0: [2:0.047, 4:0.044, 8:0.049, 9:0.041, 15:0.048, 23:0.042, 26:0.041, 38:0.047,
44:0.041, 50:0.041, 57:0.045, 58:0.046, 62:0.047, 87:0.062, 101:0.046, 106:0.048, 108:0.110,
113:0.047, 120:0.049, 135:0.045,

A bit from DRM A

Input Path: /Users/pat/Projects/big-data/b/doc-matrix/matrix
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 0: Value: {2127:1.0}
Key: 1: Value: {1:0.04140155813392109,23:0.04761729906397759,33:0.04140155813392109,35:0.03874202735546817,50:0.03318442428909763,69:0.04140155813392109,90:0.03993791262049265,100:0.04140155813392109,105:0.04140155813392109,119:0.03993791262049265,124:0.04140155813392109,133:0.036082496577015254,138:0.04140155813392109,143:

clusteredPoints from kmeans on SSVD of raw doc vectors, the input the kmeans = U * Sigma (DRM
U)

Input Path: b/clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 810: Value: 1.0: [0.047, 0.032, -0.062, -0.132, -0.006, -0.076, 0.024, 0.001, -0.040,
-0.031, -0.051, 0.058, 0.006, -0.002, 0.038, 0.040, 0.065, -0.038, 0.013, -0.004]
Key: 810: Value: 1.0: [0.208, -0.074, -0.076, -0.039, 0.036, -0.066, 0.037, -0.016, 0.008,
-0.024,

A bit from DRM U (actually U * Sigma)

Input Path: /Users/pat/Projects/big-data/b/ssvd/U/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 0: Value: {0:-0.05851791792014975,1:0.0806831653032894,2:-0.04529094469362176,3:0.07412534594545293,4:-0.0014950001103841534,5:0.00858150208231669,6:0.08167911600523817,7:-0.044944387969145426,8:0.10480124786699137,9:-0.012858223284407562,10:-0.178659257217503,11:0.004960726322870974,12:-0.009355080152537257,13:-0.08287756217734399,14:-0.06421245242503033,15:0.034723492354354006,16:-0.04544718418425494,17:-0.03280318371313618,18:0.014036530324351837,19:-0.011233038447454465}
Mime
View raw message