samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Albert Bifet <>
Subject Re: Spark 1.2 adds streaming k-means
Date Wed, 04 Feb 2015 00:55:05 GMT
There is a lot of work in stream clustering from two communities, the
streaming algorithmic community, and the machine learning/data stream
mining community.

There is a nice survey of these new methods in:

In Apache SAMOA, CluStream is implemented as one of the first
state-of-the-art methods:

and it could be nice to have more methods implemented, as for example the
ones implemented in MOA.

On Mon, Feb 2, 2015 at 7:07 AM, Ted Dunning <> wrote:

> That isn't streaming k-means in the Mahout sense.  What they have done is
> implement a very basic sort of exponential smoothing to the normal k-means
> algorithm so that only recent points contribute significantly to centroid
> location.  This assumes an initial high quality cluster and probably also
> depends on small changes in the underlying data distribution.  It doesn't
> solve the multi-start problem in high dimensions.
> The Mahout algorithm is a bit different.  The idea is that you want to do a
> single pass high quality clustering of a lot of data.  This is hard to do
> with traditional k-means, both because k-means normally requires multiple
> passes through the data to get good centroids and also because multiple
> restarts are required to get good results.  A streaming solution should
> also be able to give you an accurate clustering at any point in time with
> roughly unit-ish cost.  All these problems are solved with the Mahout
> solution.  The current problems with the Mahout solution have to do with
> the fact that the map-reduce solution has poor scaling properties due to
> the non-trivial size of the cluster sketches.
> On Thu, Jan 29, 2015 at 7:24 AM, Gianmarco De Francisci Morales <
>> wrote:
> > Seems they started to play with streaming algorithms also in Spark and
> > MLlib.
> >
> >
> >
> > I wonder how much the mini-batch programming model they have fits
> > traditional streaming algorithms.
> > Also, I guess the concept of state across the stream does not fit very
> well
> > the abstraction of RDDs.
> >
> > Interesting to read nevertheless.
> >
> > Cheers,
> > --
> > Gianmarco
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message