mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: kmeans from 0.6 to 0.7
Date Thu, 07 Jun 2012 18:57:59 GMT
Further is appears the clusters file is now IntWritable, 
ClusterWritable? This according to seqdumper. But the output of 
clusterdump on the same file still shows what looks like strings as keys 
with the "VL-" prepended to each cluster id.  I'm having trouble 
iterating through the clusters file because I'm confused about the type 
of it's contents. I create the iterator thus:

SequenceFileIterator<IntWritable, ClusterWritable> iterator = new 
SequenceFileIterator<IntWritable, 
ClusterWritable>(clusterConf.getClusterFiles(), true, conf);

This produces the error:
Exception in thread "main" java.lang.IllegalStateException: 
java.io.IOException: 
org.apache.mahout.clustering.iterator.ClusterWritable@2ecc5436 read 122 
bytes, should read 8419

So I must have the types wrong?


The output of seqdumper looks like this:

Input Path: clusters-7-final/part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.clustering.iterator.ClusterWritable
Key: 0: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Key: 1: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Key: 2: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Key: 3: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Key: 4: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Key: 5: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Key: 6: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Key: 7: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Key: 8: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Key: 9: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Key: 10: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@18b1aebf
Count: 11

The output of clusterdump starts out like this:
VL-500{n=74 c=[6:0.006, 24:0.003, 26:0.001, 29:0.004, 33:0.011, 
43:0.001, 65:0.001, 69:0.002, 74:0.026, 77:0.011, 98:0.002, 104:0.002, 
110:0.010, 111:0.014, 112:0.003, 133:0.006, 134:0.005, 137:0.001, 
142:0.013, 143:0.003, 144:0.002, 145:0.002, 147:0.005, 151:0.028, 
154:0.005, 179:0.007, 184:0.028, 188:0.003, 191:0.003, 208:0.010, 
217:0.013, 22

On 6/7/12 10:00 AM, Pat Ferrel wrote:
> It appears that in kmeans the clusteredPoints are now written as 
> WeightedVectorWritable where in mahout 0.6 they were 
> WeightedPropertyVectorWritable? This means that the distance from the 
> centroid is no longer stored here? Why? I hope I'm wrong because that 
> is not a welcome change. How is one to order clustered docs by 
> distance from cluster centroid?
>
> I'm sure I could calculate the distance but that would mean looking up 
> the centroid for the cluster id given in the above 
> WeightedVectorWritable, which means iterating through all the clusters 
> for each clustered doc. In my case the number of clusters could be 
> fairly large.
>
> Am I missing something?
>
>

Mime
View raw message