(Sorry if I am misunderstanding the question or calculation.)
I think your point is, in a sparse vector, shouldn't we ignore the
'fake' zeroes we observe where vectors are missing an element?
Yes and no. Here, for any point you consider, you need that point's
position along all dimensions. So I don't think you can say, well,
vector 1 and 2 are at 3 along dimension 1, and vector 3 is at 0, but
I'll pretend it isn't there for my purposes. Why ignore 0? it seems
like a real value!
Or, if you say it's a 'fake' 0, then you cannot compute the centroid's
position in that dimension since you do not have enough information 
you don't know where vector 3 is in that dimension.
If they all had a 0 in the first dimension, you might say, well, none
of the vectors have any information about that dimension, so I'll
ignore it. If that's what you mean, I agree  again, you have no real
information on which to compute a centroid position in that dimension.
So how do you tell what is fake, what is a real 0?
I actually think this is a problem with the representation... when you
set a dimension to 0, internally that dimension's value is removed/not
stored. So there is no way to distinguish between 0 and "no value" 
but conceptually those two are quite different things. Am I right
about this? I think the API would have to change then. I tripped on a
very similar problem early on in implementing cosinemeasure / Pearson
correlation, for exactly the same reason. (Or perhaps I am just
projecting my past problem/solution here.)
On Thu, May 28, 2009 at 7:52 AM, Shashikant Kore <shashikant@gmail.com> wrote:
> Jeff,
>
> Thank you for pointing the error. Not sure what I was thinking when I
> wrote cardinality as the denominator.
>
> My concern in the following code is that the total is divided by
> numPoints. For a term, only few of the numPoints vectors have
> contributed towards the weight. Rest had the value set to zero. That
> drags down the average and it much more pronounced in a large set of
> sparse vectors.
>
> For example, consider following doc vectors.
>
> v1 : [0:3, 1:6, 2:0, 3:3]
> v2:[0:3, 1:0, 2:0, 3:6]
> v3: [0:0, 1:0, 2:3, 3:0]
>
> The centroid will be :
>
> Centroid: [0:2, 1:2, 2:1, 3:3]
>
> The problem I face with existing centroid calculation is out of 100k
> documents, only a few thousand (or even lower) documents contribute
> the weight of a term. When that weight is divided by 100k, the weight
> comes very close to zero. I am looking for ways to avoid that.
>
> If we consider only nonzero values, centroid will be
> Centroid: [0:3, 1:6, 2:3, 3:4.5]
>
> Is this centroid "better" if we are considering a large number of
> sparse vectors?
>
> shashi
>
> On Thu, May 28, 2009 at 7:59 AM, Jeff Eastman
> <jdog@windwardsolutions.com> wrote:
>> Hi Shashi,
>>
>> I'm not sure I understand your issue. The Canopy centroid calculation
>> divides the individual term totals by the number of points that have been
>> added to the cluster, not by the cardinality of the vector:
>>
>> public Vector computeCentroid() {
>> Vector result = new SparseVector(pointTotal.cardinality());
>> for (int i = 0; i < pointTotal.cardinality(); i++)
>> result.set(i, pointTotal.get(i) / numPoints);
>> return result;
>> }
>>
>> Am I misinterpreting something?
>> Jeff
>>
>> Shashikant Kore wrote:
>>>
>>> Hi,
>>>
>>> To calculate the centroid (say in Canopy clustering) of a set of
>>> sparse vectors, all the nonzero weights are added for each term and
>>> then divided by the cardinality of the vector. Which is the average of
>>> weights of a term in all the vectors.
>>>
>>> I have sparse vectors of cardinalty of 50,000+, but each vector has
>>> only couple of hundreds of terms. While calculating centroid, for
>>> each term, only few hundred documents with nonzero term weights
>>> contribute to the total weight, but since it is divided by the
>>> cardinalty(50,000), the final weight is miniscule. This results into
>>> small document being marked closer to the centroid as they have fewer
>>> terms in them. The clusters don't look "right."
>>>
>>> I am wondering if the term weights of centroid should be calculated by
>>> considering only the nonzero elements. That is, if a term has occurs
>>> in 10 vectors, then the weight of the term in centroid is the average
>>> of these 10 weight values. I couldn't locate any literature which
>>> specifically talks about the case of sparse vectors in centroid
>>> calculation. Any pointers are appreciated.
>>>
>>> Thanks,
>>> shashi
>>>
>>>
>>
>
