mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: kMeans Help
Date Sat, 27 Jun 2009 15:42:03 GMT
I think this comment is on the right track. During an iteration, each 
cluster is created with a center and no points. Then, as each point is 
compared against the cluster centers, it is added to the closest 
cluster. If the initial center is considered to be a point, then it will 
bias the new centroid calculation towards its center, incorrectly, as 
shown below.

One could argue that the centroid of a degenerate cluster with no points 
ought to be its center and not a zero vector, but clusters with points 
should have centroids that do not include it.

nfantone wrote:
> On Sat, Jun 27, 2009 at 8:10 AM, Grant Ingersoll<> wrote:
>> On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote:
>>> The semantics of constructing a Cluster are odd to me.  Do I always have
>>> to immediately add a point to the Cluster in order for it to be "real",
>>> despite the fact that I added a Center?  Isn't adding a Center effectively
>>> giving the Cluster one point?
> Perhaps I misunderstood you, but I think that by assigning a new point
> (by calling addPoint(Vector)) to a Cluster does not mean you are
> "adding a center". A center is specified at the beginning of the
> algorithm and every iteration, after including a set of new points,
> recalculates that center by determining a new means - which is now the
> centroid of that particular Cluster. So, clearly, the center itself is
> a proper point in the Cluster and you don't need to add it after being
> selected as that in order for it to be "real".
>> And if you add the center, why isn't it the centroid until other points are
>> added?
> Again, the centroid is the result of a recalculation of a means and
> may or may not be a real point. By having just one point in a Cluster
> - that is to say, its center - there's no "recalculation" to be done.
> Conceptually, you could say the centroid lies, in fact, in the center
> - though, it's not relevant to the algorithm.
> A final example. Let's say you create a Cluster C with point (1,1) as
> its center. Then, you add (3,3) to it.
> Cluster C: (1,1);(3,3) - original center: (1,1) - centroid: (2,2)
> Now, you create another Cluster C' with the same center, but decide to
> add the point again. Then, (3,3) is added.
> Cluster C': (1,1);(1,1);(3,3) - original center: (1,1) - centroid (5/3, 5/3).
> Ok, that was an unnecesary example. Got it. But it shows that C and C'
> are not the same cluster, based on the fact that point repetition
> contribute to a general means.

View raw message