mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Elkan triangle optimization for K-Means
Date Mon, 07 Mar 2011 08:13:11 GMT
This talks about how to make an incomplete distance matrix:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.7687

On Sun, Mar 6, 2011 at 3:04 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> In any case, can't you have the mappers write the distances you want to a
> side file with some indexing information based on the offset in the original
> data file?  This would allow subsequent maps to read from those side files
> and perform a map-side join since the cached data would (a) definitely be in
> the same order as the original map file and (b) probably even be on the same
> machine.
>
> The serious question is whether this will actually help or not.  Typically
> with really large computations, the key speed factor is total amount of I/O.
>  This optimization will speed up the computations, but without storing the
> data in memory, the read bandwidth may be low enough that increasing the
> amount of data you read could dominate the compute savings.
>
> On Sun, Mar 6, 2011 at 2:58 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>> Any sort of complete distance matrix makes an algorithm unscalable because
>> storage size is n^2 in the number of items being clustered.
>>
>> Is this the paper you are implicitly referring to?
>>
>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8422
>>
>> <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8422>Do you
>> really need to keep distances to all centers?  Or just the distance to the
>> closest center and the distances between centers?
>>
>> On Sun, Mar 6, 2011 at 1:39 PM, Gustavo Enrique Salazar Torres <
>> gsalazar@ime.usp.br> wrote:
>>
>>> Hi there
>>>
>>> I have implemented a version of K-Means using Elkan's triangle
>>> optimization
>>> but I'm facing
>>> OutOfMemory errors since it stores distances between all points and its
>>> clusters.
>>> Is there any way to distribute this matrix over a set of Hadoop nodes?
>>>
>>> Thanks.
>>>
>>> --
>>> Gustavo Salazar Torres, Phd(C).
>>> UOL Analyst
>>> ---------------------------------
>>> "En la sencillez de mi corazon te he dado todo con alegría" Mons. Luigi
>>> Giussani
>>>
>>> "When describing your own work, be humble and don’t use superlatives of
>>> praise, either
>>> explicitly or implicitly, even if you are enthusiastic" Donald Knuth
>>>
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message