mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: kMeans Help
Date Sun, 28 Jun 2009 20:56:23 GMT
I get all of this, my point is that when you rehydrate the Cluster, it  
doesn't properly report the centroid per my email all because  
numPoints == 0 and pointTotal is a a vector that is the same as the  
passed in center vector, but initialized to 0.

On Jun 27, 2009, at 11:42 AM, Jeff Eastman wrote:

> I think this comment is on the right track. During an iteration,  
> each cluster is created with a center and no points. Then, as each  
> point is compared against the cluster centers, it is added to the  
> closest cluster. If the initial center is considered to be a point,  
> then it will bias the new centroid calculation towards its center,  
> incorrectly, as shown below.
> One could argue that the centroid of a degenerate cluster with no  
> points ought to be its center and not a zero vector, but clusters  
> with points should have centroids that do not include it.
> nfantone wrote:
>> On Sat, Jun 27, 2009 at 8:10 AM, Grant  
>> Ingersoll<> wrote:
>>> On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote:
>>>> The semantics of constructing a Cluster are odd to me.  Do I  
>>>> always have
>>>> to immediately add a point to the Cluster in order for it to be  
>>>> "real",
>>>> despite the fact that I added a Center?  Isn't adding a Center  
>>>> effectively
>>>> giving the Cluster one point?
>> Perhaps I misunderstood you, but I think that by assigning a new  
>> point
>> (by calling addPoint(Vector)) to a Cluster does not mean you are
>> "adding a center". A center is specified at the beginning of the
>> algorithm and every iteration, after including a set of new points,
>> recalculates that center by determining a new means - which is now  
>> the
>> centroid of that particular Cluster. So, clearly, the center itself  
>> is
>> a proper point in the Cluster and you don't need to add it after  
>> being
>> selected as that in order for it to be "real".
>>> And if you add the center, why isn't it the centroid until other  
>>> points are
>>> added?
>> Again, the centroid is the result of a recalculation of a means and
>> may or may not be a real point. By having just one point in a Cluster
>> - that is to say, its center - there's no "recalculation" to be done.
>> Conceptually, you could say the centroid lies, in fact, in the center
>> - though, it's not relevant to the algorithm.
>> A final example. Let's say you create a Cluster C with point (1,1) as
>> its center. Then, you add (3,3) to it.
>> Cluster C: (1,1);(3,3) - original center: (1,1) - centroid: (2,2)
>> Now, you create another Cluster C' with the same center, but decide  
>> to
>> add the point again. Then, (3,3) is added.
>> Cluster C': (1,1);(1,1);(3,3) - original center: (1,1) - centroid  
>> (5/3, 5/3).
>> Ok, that was an unnecesary example. Got it. But it shows that C and  
>> C'
>> are not the same cluster, based on the fact that point repetition
>> contribute to a general means.

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message