Hi Shashi,
I'm not sure I understand your issue. The Canopy centroid calculation
divides the individual term totals by the number of points that have
been added to the cluster, not by the cardinality of the vector:
public Vector computeCentroid() {
Vector result = new SparseVector(pointTotal.cardinality());
for (int i = 0; i < pointTotal.cardinality(); i++)
result.set(i, pointTotal.get(i) / numPoints);
return result;
}
Am I misinterpreting something?
Jeff
Shashikant Kore wrote:
> Hi,
>
> To calculate the centroid (say in Canopy clustering) of a set of
> sparse vectors, all the nonzero weights are added for each term and
> then divided by the cardinality of the vector. Which is the average of
> weights of a term in all the vectors.
>
> I have sparse vectors of cardinalty of 50,000+, but each vector has
> only couple of hundreds of terms. While calculating centroid, for
> each term, only few hundred documents with nonzero term weights
> contribute to the total weight, but since it is divided by the
> cardinalty(50,000), the final weight is miniscule. This results into
> small document being marked closer to the centroid as they have fewer
> terms in them. The clusters don't look "right."
>
> I am wondering if the term weights of centroid should be calculated by
> considering only the nonzero elements. That is, if a term has occurs
> in 10 vectors, then the weight of the term in centroid is the average
> of these 10 weight values. I couldn't locate any literature which
> specifically talks about the case of sparse vectors in centroid
> calculation. Any pointers are appreciated.
>
> Thanks,
> shashi
>
>
