mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Streaming kmeans question
Date Mon, 28 Jul 2014 19:45:34 GMT

I am traveling and it is difficult to get a real internet connection. 

Here is an answer one of your questions. 

For very dimension data, some kind of dimensionality reduction is usually important. The streaming
k-means code does the by approximating the nearest centroid by using a random projection.

Note that the output of the streaming step is *not* a set of initial centroids. Instead it
is a large number of centroids which are clustered as a surrogate for the original data. 
These centroids are much less numerous than the original data so the final ball k-means can
run in memory. This is very different than the canopy approach. 

There is a known issue with the map-reduce version of the streaming k-means program that causes
the number of centroids output by the parallel part of the algorithm to be too large. 

There is a known issue

Sent from my iPhone

> On Jul 28, 2014, at 3:08, Bojan Kostić <> wrote:
> Also as i see this stream kmeans is for large sets of data. Does this large
> means large number of points and not dimmensions? And what to do when data
> have large dimensions? Like more then 1000000 dimensions.

View raw message