Daniel
I guess that Ted is right, the paper below suggests that Elkan's
optimization is better than kdtrees:
http://www.siam.org/proceedings/datamining/2010/dm10_012_hamerlyg.pdf
Do you know any benchmark on balltrees vs elkan's optimization?
Ted, perhaps this optimized version of kmeans should be used for high
dimensional data, that way distance computation will never be fast, what do
you think?
Still I will perform some tests.
Best regards.
Gustavo
http://www.siam.org/proceedings/datamining/2010/dm10_012_hamerlyg.pdf
On Fri, Mar 25, 2011 at 5:47 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Again, the problem with large data is almost always I/O, not compute.
>
> Let's take an example with 100 centroids. Assume inputs are sparse vectors
> with 100 nonsparse elements on average.
>
> The complete distance computation costs 100 floating point operations per
> vector. Sparse vector routines can usually sustain
> a few hundred megaFlops single threaded and possibly a gigaflop with lots
> of threads. Comparison against our 100 centroids
> thus will take 10K flops = 10us on average. A vector is, however, 200 x (8
> + 4) > 2Kbytes long. At 100 MB/s, this takes 20us
> to read which means that the program will be I/O limited for sure on
> nonlocal data (where hadoop has trouble saturating a 100MB/s
> link). For local data, hadoop can sometimes sustain enough I/O to saturate
> the CPU, but just barely.
>
> If you are running on EC2, these factors will be much worse and there will
> be almost no chance of saturating the CPU.
>
> Being very clever about the distance computation may cut the cost by a
> small
> factor, but this is unlikely to make even 2:1 difference
> in the cost of the computation. Even so, the cost of running each kmeans
> iteration will be swamped by hadoop overheads.
>
> On the otherhand, building a nonhadoop version of kmeans using something
> like S4 or twister or spark or haloop would allow
> speedups of as much as 20x simply by making the hadoop iterations much more
> efficient. Committing to mapreduce v2 would
> allow similar speedups. Another way to speed things up would be to allow
> multiple kmeans iterations to happen in one pass
> over the data by implementing a gossip protocol that allows mappers to swap
> updates to centroids during a data pass. On large
> data, full passes for each iteration are not very helpful, especially at
> the
> beginning of the process.
>
>
>
>
> On Fri, Mar 25, 2011 at 12:03 PM, Daniel McEnnis <dmcennis@gmail.com>
> wrote:
>
> > Dear Ted, Gustavo,
> >
> > If you don't mind me eavesdropping.... The standard way to cut down on
> > distance computations for a number of algorithms, including KMeans, is
> > to use Ball Trees. Mapping these data structures into a mapreduce
> > framework will speed things up considerably.
> >
> > Sincerely,
> >
> > Daniel McEnnis.
> >
> > On Fri, Mar 25, 2011 at 12:53 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> > > I think that this might make a good GSoC project.
> > >
> > > But you should do a bit more analysis to determine if this will
> actually
> > > speed things up. There is a comment in the Elkan paper that if the
> > distance
> > > computation is fast that there may not be any speedup. Likewise, there
> > may
> > > be no speedup if the algorithm is I/O bound.
> > >
> > >
> > >
> > > On Fri, Mar 25, 2011 at 7:12 AM, Gustavo Enrique Salazar Torres <
> > > tavoaqp@gmail.com> wrote:
> > >
> > >> Hi Ted
> > >>
> > >> I finally got an stable version of this algorithm, although it's still
> a
> > >> sequential version.
> > >> I was thinking about your suggestion to use a side file, I will have
> to
> > >> test that option.
> > >>
> > >> Regarding storage issues, Elkan's algorithm does not use n^2 to store
> > all
> > >> distances, it just uses n*k since it has to remember lower bounds for
> > >> distances between every point and centroid.
> > >>
> > >> I also wanted to propose this algorithm as a GSOC project, what do you
> > >> think?
> > >>
> > >> Thanks!
> > >> Gustavo
> > >>
> > >> On 03/06/2011 08:04 PM, Ted Dunning wrote:
> > >>
> > >> In any case, can't you have the mappers write the distances you want
> to
> > a
> > >> side file with some indexing information based on the offset in the
> > original
> > >> data file? This would allow subsequent maps to read from those side
> > files
> > >> and perform a mapside join since the cached data would (a) definitely
> > be in
> > >> the same order as the original map file and (b) probably even be on
> the
> > same
> > >> machine.
> > >>
> > >> The serious question is whether this will actually help or not.
> > >> Typically with really large computations, the key speed factor is
> total
> > >> amount of I/O. This optimization will speed up the computations, but
> > >> without storing the data in memory, the read bandwidth may be low
> enough
> > >> that increasing the amount of data you read could dominate the compute
> > >> savings.
> > >>
> > >> On Sun, Mar 6, 2011 at 2:58 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> > >>
> > >>> Any sort of complete distance matrix makes an algorithm unscalable
> > because
> > >>> storage size is n^2 in the number of items being clustered.
> > >>>
> > >>> Is this the paper you are implicitly referring to?
> > >>>
> > >>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8422
> > >>>
> > >>> Do you really need to keep distances to all centers? Or just the
> > >>> distance to the closest center and the distances between centers?
> > >>>
> > >>> On Sun, Mar 6, 2011 at 1:39 PM, Gustavo Enrique Salazar Torres <
> > >>> gsalazar@ime.usp.br> wrote:
> > >>>
> > >>>> Hi there
> > >>>>
> > >>>> I have implemented a version of KMeans using Elkan's triangle
> > >>>> optimization
> > >>>> but I'm facing
> > >>>> OutOfMemory errors since it stores distances between all points
and
> > its
> > >>>> clusters.
> > >>>> Is there any way to distribute this matrix over a set of Hadoop
> nodes?
> > >>>>
> > >>>> Thanks.
> > >>>>
> > >>>> 
> > >>>> Gustavo Salazar Torres, Phd(C).
> > >>>> 
> > >>>> "En la sencillez de mi corazon te he dado todo con alegría" Mons.
> > Luigi
> > >>>> Giussani
> > >>>>
> > >>>> "When describing your own work, be humble and don’t use superlatives
> > of
> > >>>> praise, either
> > >>>> explicitly or implicitly, even if you are enthusiastic" Donald
Knuth
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >
> >
>
