mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <>
Subject Re: Centroid calculations with sparse vectors
Date Thu, 28 May 2009 06:52:04 GMT

Thank you for pointing the error. Not sure what I was thinking when I
wrote cardinality as the denominator.

My concern in the following code is that the total is divided by
numPoints.  For a term,  only few of the numPoints vectors have
contributed towards the weight. Rest had the value set to zero. That
drags down the average and it much more pronounced in a large set of
sparse vectors.

For example, consider following doc vectors.

v1 : [0:3,  1:6,   2:0, 3:3]
v2:[0:3,  1:0,   2:0, 3:6]
v3: [0:0,  1:0,   2:3, 3:0]

The centroid will be :

Centroid: [0:2,  1:2,   2:1, 3:3]

The problem I face with existing centroid calculation is out of 100k
documents, only a few thousand (or even lower) documents contribute
the weight of a term. When that weight is divided by 100k, the weight
comes very close to zero.  I am looking for ways to avoid that.

If we consider only non-zero values, centroid will be
Centroid: [0:3,  1:6,   2:3,  3:4.5]

Is this centroid "better" if we are considering a large number of
sparse vectors?


On Thu, May 28, 2009 at 7:59 AM, Jeff Eastman
<> wrote:
> Hi Shashi,
> I'm not sure I understand your issue. The Canopy centroid calculation
> divides the individual term totals by the number of points that have been
> added to the cluster, not by the cardinality of the vector:
>  public Vector computeCentroid() {
>   Vector result = new SparseVector(pointTotal.cardinality());
>   for (int i = 0; i < pointTotal.cardinality(); i++)
>     result.set(i, pointTotal.get(i) / numPoints);
>   return result;
>  }
> Am I misinterpreting something?
> Jeff
> Shashikant Kore wrote:
>> Hi,
>> To calculate the centroid (say in Canopy clustering) of a set of
>> sparse vectors, all the non-zero weights are added for each term and
>> then divided by the cardinality of the vector. Which is the average of
>> weights of a term in all the vectors.
>> I have sparse vectors of cardinalty of 50,000+, but each vector has
>> only couple of hundreds of terms.  While calculating centroid,  for
>> each term, only few hundred documents with non-zero term weights
>> contribute to the total weight, but since it is divided by the
>> cardinalty(50,000), the final weight is miniscule.  This results into
>> small document being marked closer to the centroid as they have fewer
>> terms in them. The clusters don't look "right."
>> I am wondering if the term weights of centroid should be calculated by
>> considering only the non-zero elements.  That is, if a term has occurs
>> in 10 vectors, then the weight of the term in centroid is the average
>> of these 10 weight values.  I couldn't locate any literature which
>> specifically talks about the case of sparse vectors in centroid
>> calculation. Any pointers are appreciated.
>> Thanks,
>> --shashi

View raw message