mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oisin Boydell <oisin.boyd...@ucd.ie>
Subject Clusterdump output format
Date Wed, 30 Jul 2014 10:42:32 GMT
Hi,

We have been using K-Means to cluster a fairly large dataset (just under a million 128 dimension
vectors of floating point values - about 9.2GB in space delimited file format). We’re using
Hadoop 2.2.0 and Mahout 0.9. The dataset is first converted from simple space delimited format
into RandomAccessSparseVector format for K-Means using the org.apache.mahout.clustering.conversion.InputDriver
utility.

We’re not using Canopy clustering to determine the initial clusters as we want a specific
number of clusters (100,000) so we let K-Means create the initial random 100,000 centroids:

./mahout kmeans -i /lookandlearn/vectors_all -c /data/initial_centres -o /data/clusters_output
-k 100000 -x 20 -ow -xm mapreduce

It all runs fine and we then extract the computed centroids using the clusterdump utility:

./mahout clusterdump -i /data/clusters_output/clusters-1-final/ -o ./clusters.txt -of TEXT

The clusters.txt output file contains the expected 100,000 lines (once cluster per line) however
there seem to be some idiosyncrasies in the output format…

If we add up all the values of n for each cluster, which should be the number of data points
belonging to each cluster, we get a total of 39,160,754. But we expect this to be the same
as the number of input points (9,769,004) as each input point should belong to a single cluster.
We are not sure why the sum of n values is nearly 4 times as large as the number of input
points.
We also notice that the vector output format for the cluster centroids and radii seem to be
in a couple of different formats. The majority are a simple comma separated array format e.g.

c=[0.008, 0.006, 0.009, 0.014, 0.006, 0.003, 0.007, 0.005, 0.032, 0.004, 0.001, 0.003, 0.002,
0.002, 0.007, 0.017, 0.011, 0.002, 0.001, 0.014, 0.032, 0.015, 0.001, 0.002, 0.025, 0.007,
0.001, 0.007, 0.031, 0.004, 0.000, 0.005, 0.006, 0.003, 0.005, 0.029, 0.023, 0.001, 0.000,
0.005, 0.032, 0.007, 0.001, 0.009, 0.014, 0.002, 0.000, 0.004, 0.011, 0.001, 0.002, 0.010,
0.032, 0.017, 0.000, 0.002, 0.013, 0.019, 0.008, 0.009, 0.017, 0.005, 0.001, 0.003, 0.007,
0.005, 0.002, 0.014, 0.021, 0.002, 0.001, 0.005, 0.032, 0.006, 0.005, 0.014, 0.016, 0.003,
0.001, 0.004, 0.006, 0.000, 0.001, 0.005, 0.031, 0.026, 0.001, 0.002, 0.009, 0.002, 0.003,
0.004, 0.006, 0.015, 0.004, 0.006, 0.006, 0.002, 0.002, 0.006, 0.003, 0.001, 0.003, 0.009,
0.004, 0.002, 0.005, 0.018, 0.012, 0.001, 0.000, 0.002, 0.001, 0.000, 0.007, 0.016, 0.021,
0.006, 0.001, 0.000, 0.006, 0.003, 0.013, 0.012, 0.003, 0.002, 0.000, 0.001]

But there are also a significant number of clusters where the format appears to be a sparse
array representation with each value prefixed by the position index e.g.

c=[0:0.056, 1:0.006, 2:0.000, 3:0.000, 4:0.000, 5:0.000, 6:0.000, 7:0.004, 8:0.057, 9:0.002,
10:0.000, 11:0.000, 12:0.000, 13:0.000, 14:0.000, 15:0.005, 16:0.056, 17:0.004, 18:0.000,
19:0.000, 20:0.000, 23:0.002, 24:0.024, 25:0.009, 26:0.013, 27:0.005, 28:0.001, 29:0.001,
30:0.000, 31:0.000, 32:0.057, 33:0.006, 34:0.000, 35:0.000, 36:0.000, 37:0.000, 38:0.000,
39:0.002, 40:0.057, 41:0.007, 42:0.000, 43:0.000, 44:0.000, 45:0.000, 46:0.000, 47:0.004,
48:0.057, 49:0.008, 50:0.000, 51:0.000, 52:0.000, 55:0.001, 56:0.050, 57:0.007, 58:0.000,
59:0.000, 60:0.000, 61:0.000, 62:0.000, 63:0.001, 64:0.057, 65:0.003, 66:0.000, 67:0.000,
68:0.000, 69:0.000, 70:0.000, 71:0.006, 72:0.057, 73:0.004, 74:0.000, 75:0.000, 76:0.000,
77:0.000, 78:0.000, 79:0.009, 80:0.057, 81:0.003, 82:0.000, 83:0.000, 84:0.000, 87:0.006,
88:0.047, 89:0.004, 90:0.000, 91:0.000, 92:0.000, 93:0.000, 94:0.000, 95:0.006, 96:0.056,
97:0.005, 98:0.000, 99:0.000, 100:0.000, 101:0.000, 102:0.000, 103:0.003, 104:0.057, 105:0.003,
106:0.000, 107:0.000, 108:0.000, 109:0.000, 110:0.000, 111:0.006, 112:0.056, 113:0.000, 114:0.000,
115:0.000, 116:0.000, 117:0.000, 118:0.000, 119:0.008, 120:0.038, 121:0.001, 122:0.000, 123:0.000,
124:0.000, 125:0.000, 126:0.000, 127:0.006]

In this case should values missing in this sparse vector format be interpreted as 0.0 e.g.
the value for dimension 21 in the above example? Why are zero values still included in his
output format (e.g. dimensions 2,3,4 etc. above) and it seems awkward to us that the clusterdump
output contains different vector formats as it makes it more complex to parse. Also we find
that if we set the clusterdump output format to CSV instead of TEXT ("-of CSV”) no output
file is produced.

Any information or feedback on the above would be greatly appreciated.

Regards,
Oisin.




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message