That isn't streaming k-means in the Mahout sense. What they have done is
implement a very basic sort of exponential smoothing to the normal k-means
algorithm so that only recent points contribute significantly to centroid
location. This assumes an initial high quality cluster and probably also
depends on small changes in the underlying data distribution. It doesn't
solve the multi-start problem in high dimensions.
The Mahout algorithm is a bit different. The idea is that you want to do a
single pass high quality clustering of a lot of data. This is hard to do
with traditional k-means, both because k-means normally requires multiple
passes through the data to get good centroids and also because multiple
restarts are required to get good results. A streaming solution should
also be able to give you an accurate clustering at any point in time with
roughly unit-ish cost. All these problems are solved with the Mahout
solution. The current problems with the Mahout solution have to do with
the fact that the map-reduce solution has poor scaling properties due to
the non-trivial size of the cluster sketches.
On Thu, Jan 29, 2015 at 7:24 AM, Gianmarco De Francisci Morales <
gdfm@apache.org> wrote:
> Seems they started to play with streaming algorithms also in Spark and
> MLlib.
>
> https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
>
> I wonder how much the mini-batch programming model they have fits
> traditional streaming algorithms.
> Also, I guess the concept of state across the stream does not fit very well
> the abstraction of RDDs.
>
> Interesting to read nevertheless.
>
> Cheers,
> --
> Gianmarco
>