mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: Elkan triangle optimization for K-Means
Date Mon, 07 Mar 2011 08:13:11 GMT
This talks about how to make an incomplete distance matrix:

On Sun, Mar 6, 2011 at 3:04 PM, Ted Dunning <> wrote:
> In any case, can't you have the mappers write the distances you want to a
> side file with some indexing information based on the offset in the original
> data file?  This would allow subsequent maps to read from those side files
> and perform a map-side join since the cached data would (a) definitely be in
> the same order as the original map file and (b) probably even be on the same
> machine.
> The serious question is whether this will actually help or not.  Typically
> with really large computations, the key speed factor is total amount of I/O.
>  This optimization will speed up the computations, but without storing the
> data in memory, the read bandwidth may be low enough that increasing the
> amount of data you read could dominate the compute savings.
> On Sun, Mar 6, 2011 at 2:58 PM, Ted Dunning <> wrote:
>> Any sort of complete distance matrix makes an algorithm unscalable because
>> storage size is n^2 in the number of items being clustered.
>> Is this the paper you are implicitly referring to?
>> <>Do you
>> really need to keep distances to all centers?  Or just the distance to the
>> closest center and the distances between centers?
>> On Sun, Mar 6, 2011 at 1:39 PM, Gustavo Enrique Salazar Torres <
>>> wrote:
>>> Hi there
>>> I have implemented a version of K-Means using Elkan's triangle
>>> optimization
>>> but I'm facing
>>> OutOfMemory errors since it stores distances between all points and its
>>> clusters.
>>> Is there any way to distribute this matrix over a set of Hadoop nodes?
>>> Thanks.
>>> --
>>> Gustavo Salazar Torres, Phd(C).
>>> UOL Analyst
>>> ---------------------------------
>>> "En la sencillez de mi corazon te he dado todo con alegría" Mons. Luigi
>>> Giussani
>>> "When describing your own work, be humble and don’t use superlatives of
>>> praise, either
>>> explicitly or implicitly, even if you are enthusiastic" Donald Knuth

Lance Norskog

View raw message