mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: Centroid calculations with sparse vectors
Date Thu, 28 May 2009 07:30:48 GMT
Tedd,

L^1/L^2 Normalization sounds like a  good solution. I will try it out
and report the results.

Is there any literature available comparison of these normalization techniques?

Thank you.

--shashi

On Thu, May 28, 2009 at 12:30 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Shashi,
>
> You are correct that this can be a problem, especially with vectors that
> have a large number of elements that are zero, but not known to be such.
>
> The definition as it stands is roughly an L^0 normalization.  It is more
> common in clustering to use an L^1 or L^2 normalization.  This would divide
> the terms by, respectively, the sum of the elements or the square root of
> the sum of the squares of the elements.  Both L^1 and L^2 normalization
> avoids the problem you mention since negligibly small elements will not
> contribute significantly to the norm.
>
> Traditionally, L^2 norms are used with documents.  This dates back to Salton
> and the term-vector model of text retrieval.  That practice was, however,
> based on somewhat inappropriate geometric intuitions.  Other norms are quite
> plausibly more appropriate.  For instance, if normalized term frequencies
> are considered to be estimates of word generation probabilities, then the
> L^1 norm is much more appropriate.
>
> On Wed, May 27, 2009 at 11:52 PM, Shashikant Kore <shashikant@gmail.com>wrote:
>
>> ...
>> My concern in the following code is that the total is divided by
>> numPoints.  For a term,  only few of the numPoints vectors have
>> contributed towards the weight. Rest had the value set to zero. That
>> drags down the average and it much more pronounced in a large set of
>> sparse vectors.
>>
>>
>

Mime
View raw message