Shashi,
You are correct that this can be a problem, especially with vectors that
have a large number of elements that are zero, but not known to be such.
The definition as it stands is roughly an L^0 normalization. It is more
common in clustering to use an L^1 or L^2 normalization. This would divide
the terms by, respectively, the sum of the elements or the square root of
the sum of the squares of the elements. Both L^1 and L^2 normalization
avoids the problem you mention since negligibly small elements will not
contribute significantly to the norm.
Traditionally, L^2 norms are used with documents. This dates back to Salton
and the termvector model of text retrieval. That practice was, however,
based on somewhat inappropriate geometric intuitions. Other norms are quite
plausibly more appropriate. For instance, if normalized term frequencies
are considered to be estimates of word generation probabilities, then the
L^1 norm is much more appropriate.
On Wed, May 27, 2009 at 11:52 PM, Shashikant Kore <shashikant@gmail.com>wrote:
> ...
> My concern in the following code is that the total is divided by
> numPoints. For a term, only few of the numPoints vectors have
> contributed towards the weight. Rest had the value set to zero. That
> drags down the average and it much more pronounced in a large set of
> sparse vectors.
>
>
