Hi Ted,
I tried L1 and L2 norms. The centroid definitely looks better, but the
values are still close to zero.
Please let me know if my understanding of L1, L2 norms is correct as
shown with the code below.
// L1 Norm
public Vector computeCentroid() {
Vector result = new SparseVector(pointTotal.cardinality());
double sum = pointTotal.zSum();
for (int i = 0; i < pointTotal.cardinality(); i++)
result.set(i, pointTotal.get(i) / sum); // Dividing the by the
sum of all weights
return result;
}
// L2 Norm
public Vector computeCentroid() {
Vector result = new SparseVector(pointTotal.cardinality());
double sumSqrt = Math.sqrt(pointTotal.squaredSum()); // Assume the
method squaredSum returns the sum of squares of the term weights
for (int i = 0; i < pointTotal.cardinality(); i++)
result.set(i, pointTotal.get(i) / sumSqurt);
return result;
}
In the document vector, the weights are TFIDF value as given by
DefaultSimilarity's tf() and idf().
Thanks,
shashi
On Thu, May 28, 2009 at 9:14 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Actually, I realize I was being a bit misleading. There are several
> concepts swirling around here that fall into groups
>
> distance, metric, norm:
>
> A *metric* is some realvalued, nonnegative function of pairs of things. A
> metric should only give 0 for identical points.
>
> A *distance* is a metric that satisfies the triangle inequality.
>
> A *norm* is a thing that assigns lengths to vectors. If you have
> subtraction, then norm(xy) is a distance. Often, the term norm is used
> loosely for functions where f(xy) only gives a metric rather than a norm.
>
> centroid, medoid, mean, median, average:
>
> A *centroid* is a value z that minimizes sum_i norm(x_i  z) where norm is
> almost always L^2
>
> A *medoid* is a single point x_m from a set of points that minimizes sum_i
> norm(x_i  x_m) for some norm (usually L^2)
>
> A *mean* is a onedimensional centroid using L^2
>
> An *median* is a onedimensional centroid using L^1 instead of L^2. This
> value is not normally unique so there are various methods for tiebreaking.
>
> If you have a continuous space (like we have been talking about), you can
> generalize centroids, but for L^1, the corresponding centroid can be costly
> to compute except in one dimension.
>
> The place that I ran things off the rails was in conflating centroid with
> the norm that defines the centroid. The L^2 norm is indeed computed as the
> root sum of squares for our purposes. But the centroid based on this norm
> is computed by averaging all of the points and the original algorithm was
> correct except that numPoints must be the number of points involved in the
> centroid computation rather than the number of nonzero elements in a single
> point.
>
> It would indeed be a good idea to think about alternative clustering. The
> move from L^2 to L^1 generally is a good thing if you want to decrease the
> sometimes overwhelming impact of outliers, but it can make things much
> harder to compute. The move from L^2 to L^1 converts kmeans to kmedoids
> and I expect that wikipedia's article on same will be pretty good.
>
> There is also a probabilistic interpretation for all of this. For L^1 and
> L^2, you can interpret minimizing the distance in the centroid computation
> as finding a probability distribution that has maximum likelihood. For L^2,
> the form of the distribution is the normal distribution: exp((x 
> mu)^2/(2s^2)). For L^1, the distribution is the socalled Laplace
> distribution: exp( abs(xmu)/s ). In general, the Laplace distribution is
> fattailed which is what gives it its outlier forgiving nature.
>
> Hope this clarification helps. Sorry for adding to confusion.
>
> On Thu, May 28, 2009 at 5:51 AM, Jeff Eastman <jdog@windwardsolutions.com>wrote:
>
>> I found this (http://en.wikipedia.org/wiki/Lp_space) with a quick Google
>> search of "L2 norm" which turned up a number of other citations too. Since
>> we already use L1 (Manhattan) and L2 (Euclidean) norms in the respective
>> DistanceMeasures it might also be appropriate to allow the cluster centroid
>> normalization to be pluggable. I had never considered that. Another mental
>> lightbulb has just turned on. Thanks Ted.
>>
>
>
>
> 
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 8584140013 (m)
> 4087730220 (fax)
>

http://www.bandhan.com/
