mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Enrique Salazar Torres <tavo...@gmail.com>
Subject Re: Elkan triangle optimization for K-Means
Date Fri, 25 Mar 2011 14:12:40 GMT
 Hi Ted

I finally got an stable version of this algorithm, although it's still a
sequential version.
I was thinking about your suggestion to use a side file, I will have to test
that option.

Regarding storage issues, Elkan's algorithm does not use n^2 to store all
distances, it just uses n*k since it has to remember lower bounds for
distances between every point and centroid.

I also wanted to propose this algorithm as a GSOC project, what do you
think?

Thanks!
Gustavo

On 03/06/2011 08:04 PM, Ted Dunning wrote:

In any case, can't you have the mappers write the distances you want to a
side file with some indexing information based on the offset in the original
data file?  This would allow subsequent maps to read from those side files
and perform a map-side join since the cached data would (a) definitely be in
the same order as the original map file and (b) probably even be on the same
machine.

 The serious question is whether this will actually help or not.  Typically
with really large computations, the key speed factor is total amount of I/O.
 This optimization will speed up the computations, but without storing the
data in memory, the read bandwidth may be low enough that increasing the
amount of data you read could dominate the compute savings.

On Sun, Mar 6, 2011 at 2:58 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Any sort of complete distance matrix makes an algorithm unscalable because
> storage size is n^2 in the number of items being clustered.
>
>  Is this the paper you are implicitly referring to?
>
>  http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8422
>
>  Do you really need to keep distances to all centers?  Or just the
> distance to the closest center and the distances between centers?
>
> On Sun, Mar 6, 2011 at 1:39 PM, Gustavo Enrique Salazar Torres <
> gsalazar@ime.usp.br> wrote:
>
>> Hi there
>>
>> I have implemented a version of K-Means using Elkan's triangle
>> optimization
>> but I'm facing
>> OutOfMemory errors since it stores distances between all points and its
>> clusters.
>> Is there any way to distribute this matrix over a set of Hadoop nodes?
>>
>> Thanks.
>>
>> --
>> Gustavo Salazar Torres, Phd(C).
>> ---------------------------------
>> "En la sencillez de mi corazon te he dado todo con alegría" Mons. Luigi
>> Giussani
>>
>> "When describing your own work, be humble and don’t use superlatives of
>> praise, either
>> explicitly or implicitly, even if you are enthusiastic" Donald Knuth
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message