The Elkan optimization doesn't speed up the distance computation. It avoids
it entirely.
The trick is that if a centroid has only moved by an amount with length
epsilon, then the distance to that centroid could only have increased or
decreased by that much. That means that the previous closest centroid
distance plus epsilon is a bound that other centroids must have some hope of
beating before computing the distance to them is even interesting. We can
also bound distances by the intercentroid distances. The problem is that
remembering old centroid steps (we have to remember lots of old steps so
that we can compute bounds against old centroids) and remembering old
distances to different clusters increases the amount of I/O we have to do.
For very sparse inputs, this could dominate the speed picture. If we don't
keep the distances to all centroids for all points, we can still bound some
of the distances and avoid distance computations, but not as strongly.
Having fast distance computations is still a great idea. It is just
orthogonal to the question of whether the Elkan trick helps.
On Sat, Mar 26, 2011 at 6:31 AM, Robin Anil <robin.anil@gmail.com> wrote:
> Gustavo, There is a VectorBenchmark class checked in. Incase someone wants
> to do some experimentation of distance computation using elkan optimization
> on some synthetic data, I would sugest you to add a decently modelled
> benchmark in that class, you can see bottlenecks and performance problems
> clearly. Should help you decide whether the optimization is good for Mahout
> or not, instead of discussing here. Code always talks better :)
>
