mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Enrique Salazar Torres <gsala...@ime.usp.br>
Subject Re: Elkan triangle optimization for K-Means
Date Fri, 25 Mar 2011 22:22:14 GMT
Daniel

I guess that Ted is right, the paper below suggests that Elkan's
optimization is better than kd-trees:

http://www.siam.org/proceedings/datamining/2010/dm10_012_hamerlyg.pdf

Do you know any benchmark on ball-trees vs elkan's optimization?

Ted, perhaps this optimized version of k-means should be used for high
dimensional data, that way distance computation will never be fast, what do
you think?
Still I will perform some tests.

Best regards.
Gustavo

http://www.siam.org/proceedings/datamining/2010/dm10_012_hamerlyg.pdf

On Fri, Mar 25, 2011 at 5:47 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Again, the problem with large data is almost always I/O, not compute.
>
> Let's take an example with 100 centroids.  Assume inputs are sparse vectors
> with 100 non-sparse elements on average.
>
> The complete distance computation costs 100 floating point operations per
> vector.  Sparse vector routines can usually sustain
> a few hundred megaFlops single threaded and possibly a giga-flop with lots
> of threads.  Comparison against our 100 centroids
> thus will take 10K flops = 10us on average.  A vector is, however, 200 x (8
> + 4) > 2Kbytes long.  At 100 MB/s, this takes 20us
> to read which means that the program will be I/O limited for sure on
> non-local data (where hadoop has trouble saturating a 100MB/s
> link).  For local data, hadoop can sometimes sustain enough I/O to saturate
> the CPU, but just barely.
>
> If you are running on EC2, these factors will be much worse and there will
> be almost no chance of saturating the CPU.
>
> Being very clever about the distance computation may cut the cost by a
> small
> factor, but this is unlikely to make even 2:1 difference
> in the cost of the computation.  Even so, the cost of running each k-means
> iteration will be swamped by hadoop overheads.
>
> On the other-hand, building a non-hadoop version of k-means using something
> like S4 or twister or spark or haloop would allow
> speedups of as much as 20x simply by making the hadoop iterations much more
> efficient.  Committing to map-reduce v2 would
> allow similar speedups.  Another way to speed things up would be to allow
> multiple k-means iterations to happen in one pass
> over the data by implementing a gossip protocol that allows mappers to swap
> updates to centroids during a data pass.  On large
> data, full passes for each iteration are not very helpful, especially at
> the
> beginning of the process.
>
>
>
>
> On Fri, Mar 25, 2011 at 12:03 PM, Daniel McEnnis <dmcennis@gmail.com>
> wrote:
>
> > Dear Ted, Gustavo,
> >
> > If you don't mind me eavesdropping.... The standard way to cut down on
> > distance computations for a number of algorithms, including KMeans, is
> > to use Ball Trees.  Mapping these data structures into a map-reduce
> > framework will speed things up considerably.
> >
> > Sincerely,
> >
> > Daniel McEnnis.
> >
> > On Fri, Mar 25, 2011 at 12:53 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> > > I think that this might make a good GSoC project.
> > >
> > > But you should do a bit more analysis to determine if this will
> actually
> > > speed things up.  There is a comment in the Elkan paper that if the
> > distance
> > > computation is fast that there may not be any speedup.  Likewise, there
> > may
> > > be no speedup if the algorithm is I/O bound.
> > >
> > >
> > >
> > > On Fri, Mar 25, 2011 at 7:12 AM, Gustavo Enrique Salazar Torres <
> > > tavoaqp@gmail.com> wrote:
> > >
> > >>  Hi Ted
> > >>
> > >> I finally got an stable version of this algorithm, although it's still
> a
> > >> sequential version.
> > >> I was thinking about your suggestion to use a side file, I will have
> to
> > >> test that option.
> > >>
> > >> Regarding storage issues, Elkan's algorithm does not use n^2 to store
> > all
> > >> distances, it just uses n*k since it has to remember lower bounds for
> > >> distances between every point and centroid.
> > >>
> > >> I also wanted to propose this algorithm as a GSOC project, what do you
> > >> think?
> > >>
> > >> Thanks!
> > >> Gustavo
> > >>
> > >> On 03/06/2011 08:04 PM, Ted Dunning wrote:
> > >>
> > >> In any case, can't you have the mappers write the distances you want
> to
> > a
> > >> side file with some indexing information based on the offset in the
> > original
> > >> data file?  This would allow subsequent maps to read from those side
> > files
> > >> and perform a map-side join since the cached data would (a) definitely
> > be in
> > >> the same order as the original map file and (b) probably even be on
> the
> > same
> > >> machine.
> > >>
> > >>  The serious question is whether this will actually help or not.
> > >>  Typically with really large computations, the key speed factor is
> total
> > >> amount of I/O.  This optimization will speed up the computations, but
> > >> without storing the data in memory, the read bandwidth may be low
> enough
> > >> that increasing the amount of data you read could dominate the compute
> > >> savings.
> > >>
> > >> On Sun, Mar 6, 2011 at 2:58 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> > >>
> > >>> Any sort of complete distance matrix makes an algorithm unscalable
> > because
> > >>> storage size is n^2 in the number of items being clustered.
> > >>>
> > >>>  Is this the paper you are implicitly referring to?
> > >>>
> > >>>  http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8422
> > >>>
> > >>>  Do you really need to keep distances to all centers?  Or just the
> > >>> distance to the closest center and the distances between centers?
> > >>>
> > >>> On Sun, Mar 6, 2011 at 1:39 PM, Gustavo Enrique Salazar Torres <
> > >>> gsalazar@ime.usp.br> wrote:
> > >>>
> > >>>> Hi there
> > >>>>
> > >>>> I have implemented a version of K-Means using Elkan's triangle
> > >>>> optimization
> > >>>> but I'm facing
> > >>>> OutOfMemory errors since it stores distances between all points
and
> > its
> > >>>> clusters.
> > >>>> Is there any way to distribute this matrix over a set of Hadoop
> nodes?
> > >>>>
> > >>>> Thanks.
> > >>>>
> > >>>> --
> > >>>> Gustavo Salazar Torres, Phd(C).
> > >>>> ---------------------------------
> > >>>> "En la sencillez de mi corazon te he dado todo con alegría" Mons.
> > Luigi
> > >>>> Giussani
> > >>>>
> > >>>> "When describing your own work, be humble and don’t use superlatives
> > of
> > >>>> praise, either
> > >>>> explicitly or implicitly, even if you are enthusiastic" Donald
Knuth
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message