samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Albert Bifet <abi...@waikato.ac.nz>
Subject Re: Spark 1.2 adds streaming k-means
Date Wed, 04 Feb 2015 00:55:05 GMT
There is a lot of work in stream clustering from two communities, the
streaming algorithmic community, and the machine learning/data stream
mining community.

There is a nice survey of these new methods in:

https://www.researchgate.net/publication/257132178_Data_Stream_Clustering_A_Survey

http://dl.acm.org/citation.cfm?id=2522981

In Apache SAMOA, CluStream is implemented as one of the first
state-of-the-art methods:

http://www.vldb.org/conf/2003/papers/S04P02.pdf

and it could be nice to have more methods implemented, as for example the
ones implemented in MOA.


On Mon, Feb 2, 2015 at 7:07 AM, Ted Dunning <tdunning@maprtech.com> wrote:

> That isn't streaming k-means in the Mahout sense.  What they have done is
> implement a very basic sort of exponential smoothing to the normal k-means
> algorithm so that only recent points contribute significantly to centroid
> location.  This assumes an initial high quality cluster and probably also
> depends on small changes in the underlying data distribution.  It doesn't
> solve the multi-start problem in high dimensions.
>
> The Mahout algorithm is a bit different.  The idea is that you want to do a
> single pass high quality clustering of a lot of data.  This is hard to do
> with traditional k-means, both because k-means normally requires multiple
> passes through the data to get good centroids and also because multiple
> restarts are required to get good results.  A streaming solution should
> also be able to give you an accurate clustering at any point in time with
> roughly unit-ish cost.  All these problems are solved with the Mahout
> solution.  The current problems with the Mahout solution have to do with
> the fact that the map-reduce solution has poor scaling properties due to
> the non-trivial size of the cluster sketches.
>
>
>
>
>
> On Thu, Jan 29, 2015 at 7:24 AM, Gianmarco De Francisci Morales <
> gdfm@apache.org> wrote:
>
> > Seems they started to play with streaming algorithms also in Spark and
> > MLlib.
> >
> >
> https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
> >
> > I wonder how much the mini-batch programming model they have fits
> > traditional streaming algorithms.
> > Also, I guess the concept of state across the stream does not fit very
> well
> > the abstraction of RDDs.
> >
> > Interesting to read nevertheless.
> >
> > Cheers,
> > --
> > Gianmarco
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message