Jeff,
Thank you for pointing the error. Not sure what I was thinking when I
wrote cardinality as the denominator.
My concern in the following code is that the total is divided by
numPoints. For a term, only few of the numPoints vectors have
contributed towards the weight. Rest had the value set to zero. That
drags down the average and it much more pronounced in a large set of
sparse vectors.
For example, consider following doc vectors.
v1 : [0:3, 1:6, 2:0, 3:3]
v2:[0:3, 1:0, 2:0, 3:6]
v3: [0:0, 1:0, 2:3, 3:0]
The centroid will be :
Centroid: [0:2, 1:2, 2:1, 3:3]
The problem I face with existing centroid calculation is out of 100k
documents, only a few thousand (or even lower) documents contribute
the weight of a term. When that weight is divided by 100k, the weight
comes very close to zero. I am looking for ways to avoid that.
If we consider only nonzero values, centroid will be
Centroid: [0:3, 1:6, 2:3, 3:4.5]
Is this centroid "better" if we are considering a large number of
sparse vectors?
shashi
On Thu, May 28, 2009 at 7:59 AM, Jeff Eastman
<jdog@windwardsolutions.com> wrote:
> Hi Shashi,
>
> I'm not sure I understand your issue. The Canopy centroid calculation
> divides the individual term totals by the number of points that have been
> added to the cluster, not by the cardinality of the vector:
>
> public Vector computeCentroid() {
> Vector result = new SparseVector(pointTotal.cardinality());
> for (int i = 0; i < pointTotal.cardinality(); i++)
> result.set(i, pointTotal.get(i) / numPoints);
> return result;
> }
>
> Am I misinterpreting something?
> Jeff
>
> Shashikant Kore wrote:
>>
>> Hi,
>>
>> To calculate the centroid (say in Canopy clustering) of a set of
>> sparse vectors, all the nonzero weights are added for each term and
>> then divided by the cardinality of the vector. Which is the average of
>> weights of a term in all the vectors.
>>
>> I have sparse vectors of cardinalty of 50,000+, but each vector has
>> only couple of hundreds of terms. While calculating centroid, for
>> each term, only few hundred documents with nonzero term weights
>> contribute to the total weight, but since it is divided by the
>> cardinalty(50,000), the final weight is miniscule. This results into
>> small document being marked closer to the centroid as they have fewer
>> terms in them. The clusters don't look "right."
>>
>> I am wondering if the term weights of centroid should be calculated by
>> considering only the nonzero elements. That is, if a term has occurs
>> in 10 vectors, then the weight of the term in centroid is the average
>> of these 10 weight values. I couldn't locate any literature which
>> specifically talks about the case of sparse vectors in centroid
>> calculation. Any pointers are appreciated.
>>
>> Thanks,
>> shashi
>>
>>
>
