mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: Centroid calculations with sparse vectors
Date Fri, 29 May 2009 05:56:43 GMT
Hi Ted,

I tried L1 and L2 norms. The centroid definitely looks better, but the
values are still close to zero.

Please let me know if my understanding of L1, L2 norms is correct as
shown with the code below.

// L1 Norm
 public Vector computeCentroid() {
  Vector result = new SparseVector(pointTotal.cardinality());
  double sum = pointTotal.zSum();
  for (int i = 0; i < pointTotal.cardinality(); i++)
    result.set(i, pointTotal.get(i) / sum);  // Dividing the by the
sum of all weights
  return result;
 }


// L2 Norm
 public Vector computeCentroid() {
  Vector result = new SparseVector(pointTotal.cardinality());
  double sumSqrt = Math.sqrt(pointTotal.squaredSum());   // Assume the
method squaredSum returns the sum of squares of the term weights
  for (int i = 0; i < pointTotal.cardinality(); i++)
    result.set(i, pointTotal.get(i) / sumSqurt);
  return result;
 }

In the document vector, the weights are TF-IDF value as given by
DefaultSimilarity's tf() and idf().

Thanks,

--shashi

On Thu, May 28, 2009 at 9:14 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Actually, I realize I was being a bit misleading.  There are several
> concepts swirling around here that fall into groups
>
> distance, metric, norm:
>
> A *metric* is some real-valued, non-negative function of pairs of things.  A
> metric should only give 0 for identical points.
>
> A *distance* is a metric that satisfies the triangle inequality.
>
> A *norm* is a thing that assigns lengths to vectors.  If you have
> subtraction, then norm(x-y) is a distance.  Often, the term norm is used
> loosely for functions where f(x-y) only gives a metric rather than a norm.
>
> centroid, medoid, mean, median, average:
>
> A *centroid* is a value z that minimizes sum_i norm(x_i - z) where norm is
> almost always L^2
>
> A *medoid* is a single point x_m from a set of points that minimizes sum_i
> norm(x_i - x_m) for some norm (usually L^2)
>
> A *mean* is a one-dimensional centroid using L^2
>
> An *median* is a one-dimensional centroid using L^1 instead of L^2.  This
> value is not normally unique so there are various methods for tie-breaking.
>
> If you have a continuous space (like we have been talking about), you can
> generalize centroids, but for L^1, the corresponding centroid can be costly
> to compute except in one dimension.
>
> The place that I ran things off the rails was in conflating centroid with
> the norm that defines the centroid.  The L^2 norm is indeed computed as the
> root sum of squares for our purposes.  But the centroid based on this norm
> is computed by averaging all of the points and the original algorithm was
> correct except that numPoints must be the number of points involved in the
> centroid computation rather than the number of non-zero elements in a single
> point.
>
> It would indeed be a good idea to think about alternative clustering.  The
> move from L^2 to L^1 generally is a good thing if you want to decrease the
> sometimes overwhelming impact of outliers, but it can make things much
> harder to compute.  The move from L^2 to L^1 converts k-means to k-medoids
> and I expect that wikipedia's article on same will be pretty good.
>
> There is also a probabilistic interpretation for all of this.  For L^1 and
> L^2, you can interpret minimizing the distance in the centroid computation
> as finding a probability distribution that has maximum likelihood.  For L^2,
> the form of the distribution is the normal distribution: exp(-(x -
> mu)^2/(2s^2)).  For L^1, the distribution is the so-called Laplace
> distribution: exp( -abs(x-mu)/s ).  In general, the Laplace distribution is
> fat-tailed which is what gives it its outlier forgiving nature.
>
> Hope this clarification helps.  Sorry for adding to confusion.
>
> On Thu, May 28, 2009 at 5:51 AM, Jeff Eastman <jdog@windwardsolutions.com>wrote:
>
>> I found this (http://en.wikipedia.org/wiki/Lp_space) with a quick Google
>> search of "L2 norm" which turned up a number of other citations too. Since
>> we already use L1 (Manhattan) and L2 (Euclidean) norms in the respective
>> DistanceMeasures it might also be appropriate to allow the cluster centroid
>> normalization to be pluggable. I had never considered that. Another mental
>> lightbulb has just turned on. Thanks Ted.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
>



-- 
http://www.bandhan.com/

Mime
View raw message