mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: Centroid calculations with sparse vectors
Date Thu, 28 May 2009 12:51:01 GMT
I found this ( with a quick Google 
search of "L2 norm" which turned up a number of other citations too. 
Since we already use L1 (Manhattan) and L2 (Euclidean) norms in the 
respective DistanceMeasures it might also be appropriate to allow the 
cluster centroid normalization to be pluggable. I had never considered 
that. Another mental lightbulb has just turned on. Thanks Ted.

Thinking more about DistanceMeasures, is it in general true that 
'distance(X, Y)' is just 'norm(X - Y)', for some choice of norm(x)? If 
so, would it be reasonable to factor the norm(x) out of the 
DistanceMeasures and make that be the pluggable clustering parameter? 
Hum, no, CosineDistanceMeasure normalizes by L2(X)*L2(Y), but 
some refactoring might still be possible along these lines. Perhaps a 
smaller number of DistanceMeasures would ensue.

Would users typically want to cluster using the same norm(x) for 
distance and centroid, or should they be independently pluggable?


Shashikant Kore wrote:
> Tedd,
> L^1/L^2 Normalization sounds like a  good solution. I will try it out
> and report the results.
> Is there any literature available comparison of these normalization techniques?
> Thank you.
> --shashi
> On Thu, May 28, 2009 at 12:30 PM, Ted Dunning <> wrote:
>> Shashi,
>> You are correct that this can be a problem, especially with vectors that
>> have a large number of elements that are zero, but not known to be such.
>> The definition as it stands is roughly an L^0 normalization.  It is more
>> common in clustering to use an L^1 or L^2 normalization.  This would divide
>> the terms by, respectively, the sum of the elements or the square root of
>> the sum of the squares of the elements.  Both L^1 and L^2 normalization
>> avoids the problem you mention since negligibly small elements will not
>> contribute significantly to the norm.
>> Traditionally, L^2 norms are used with documents.  This dates back to Salton
>> and the term-vector model of text retrieval.  That practice was, however,
>> based on somewhat inappropriate geometric intuitions.  Other norms are quite
>> plausibly more appropriate.  For instance, if normalized term frequencies
>> are considered to be estimates of word generation probabilities, then the
>> L^1 norm is much more appropriate.
>> On Wed, May 27, 2009 at 11:52 PM, Shashikant Kore <>wrote:
>>> ...
>>> My concern in the following code is that the total is divided by
>>> numPoints.  For a term,  only few of the numPoints vectors have
>>> contributed towards the weight. Rest had the value set to zero. That
>>> drags down the average and it much more pronounced in a large set of
>>> sparse vectors.

View raw message