# spark-issues mailing list archives

##### Site index · List index
Message view
Top
From "zhengruifeng (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-31007) KMeans optimization based on triangle-inequality
Date Mon, 02 Mar 2020 10:07:00 GMT
```
[ https://issues.apache.org/jira/browse/SPARK-31007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhengruifeng updated SPARK-31007:
---------------------------------
Description:
In current impl, following Lemma is used in KMeans:

0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))|
= |norm(x)-norm(c)|

this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}

According to [Using the Triangle Inequality to Accelerate K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]]
we can go futher, and there are another two Lemmas can be used:

1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);

this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}.

However, luckily for CosineDistance we can get a variant in the space of radian/angle.

2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, d(x,b)-d(b,c)};

this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}

The application of Lemma 2 is a little complex: It need to cache/update the distance/lower
bounds to previous centers, and thus can be only applied in training, not usable in predction.

So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is close to center
b enough (less than a pre-computed radius), then we can say point x belong to center c without
computing the distances between x and other centers. It can be used in both training and predction.

was:
In current impl, following Lemma is used in KMeans:

0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))|
= |norm(x)-norm(c)|

this can be applied in \{{EuclideanDistance}}, but not in \{{CosineDistance}}

According to [Using the Triangle Inequality to Accelerate K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf],]
we can go futher, and there are another two Lemmas can be used:

1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);

this can be applied in \{{EuclideanDistance}}, but not in \{{CosineDistance}}.

However, luckily for CosineDistance we can get a variant in the space of radian/angle.

2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, d(x,b)-d(b,c)};

this can be applied in \{{EuclideanDistance}}, but not in \{{CosineDistance}}

The application of Lemma 2 is a little complex: It need to cache/update the distance/lower
bounds to previous centers, and thus can be only applied in training, not usable in predction.

So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is close to center
b enough (less than a pre-computed radius), then we can say point x belong to center c without
computing the distances between x and other centers. It can be used in both training and predction.

> KMeans optimization based on triangle-inequality
> ------------------------------------------------
>
>                 Key: SPARK-31007
>                 URL: https://issues.apache.org/jira/browse/SPARK-31007
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.1.0
>            Reporter: zhengruifeng
>            Assignee: zhengruifeng
>            Priority: Major
>         Attachments: ICML03-022.pdf
>
>
> In current impl, following Lemma is used in KMeans:
> 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o)
- d(c,o))| = |norm(x)-norm(c)|
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> According to [Using the Triangle Inequality to Accelerate K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]]
we can go futher, and there are another two Lemmas can be used:
> 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >=
d(x,b);
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}.
> However, luckily for CosineDistance we can get a variant in the space of radian/angle.
> 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, d(x,b)-d(b,c)};
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> The application of Lemma 2 is a little complex: It need to cache/update the distance/lower
bounds to previous centers, and thus can be only applied in training, not usable in predction.
> So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is close to
center b enough (less than a pre-computed radius), then we can say point x belong to center
c without computing the distances between x and other centers. It can be used in both training
and predction.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org

```
Mime
View raw message