Dmitriy is correct in that the Streaming KMeans in MlLib is a wrong name
for something that was meant to convey "Spark Streaming + KMeans".
The Mahout Streaming KMeans is an implementation of the Meyerson paper
that's been referred to in Dmitriy's email.
I have had folks wrongly misconstrue Streaming KMeans as being Spark
Streaming + KMeans, thanks to the bad naming on part of the MlLib folks.
I had spoken to Jeremy Freeman, the MLLib Streaming KMEans contributor
about this and he agrees that the intent was to convery "Spark Streaming +
KMeans", he definitely wasn't aware of the Streaming KMeans algorithm that
existed much before Spark Streaming.
On Tue, Jun 16, 2015 at 5:34 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> "streaming kmeans" is something else afaik. Streaming kmeans is reserved
> for a particular kmeans method (in Mahout, at least, [1]).
>
> Whereas as far as i understand what mllib calls "streaming kmeans" is name
> given by mllib contributor which really means "online kmeans", i.e. radar
> tracking of centroids over time over stream of (x_i, t_i) pairs and that
> uses Spark streaming, but has nothing to do with Shindler et. al. method.
>
> At least that was our understanding last time we looked at the issue of the
> names here.
>
> So this issue has come several times already when people come and say,
> "what? streaming kmeans? mllib has streaming kmeans" whereas everybody
> else is talking about something else.
>
> [1]
>
> http://papers.nips.cc/paper/4362fastandaccuratekmeansforlargedatasets.pdf
>
> On Tue, Jun 16, 2015 at 2:04 PM, RJ Nowling <rnowling@gmail.com> wrote:
>
> > There is a streaming kmeans implementation in MLlib that uses reservoir
> > sampling.
> >
> > On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic <marko.dinic@nissatech.com>
> > wrote:
> >
> > > Ted,
> > >
> > > Thank you for your answer.
> > >
> > > What would you then recommend me to do? My idea is to implement it to
> > > enable clustering of time series using DTW (Dynamic Time Warping) as
> > > distance measure. As you know, the main problem is that Kmedoids is
> not
> > > scalable, so that's standing on my way. Of course, it could be used
> with
> > > other distances as well.
> > >
> > > I have already implemented something that I consider a scalable
> > Kmedoids,
> > > based on using pivots to speed up medoid selection (
> > > https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This
> > > works for distance measures such as Euclidean, has some limitations
> (best
> > > results are in case of normal distribution, outliers could be a
> problem),
> > > but it works pretty good (considering the computations). The thing is,
> it
> > > can't be used with DTW, since it relies on projections, while triangle
> > > inequality for DTW does not hold. That is why I'm considering this
> > > Streaming approach now.
> > >
> > > Would you think that it is worthy of giving a shot? I'm really
> stretching
> > > for a scalable solution.
> > >
> > > Best regards,
> > > Marko
> > >
> > >
> > > On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote:
> > >
> > >> The streaming kmeans works by building a sketch of the data which is
> > then
> > >> used to do real clustering.
> > >>
> > >> It might be that this sketch would be acceptable to do kmedoids, but
> > that
> > >> is definitely not guaranteed.
> > >>
> > >> Similarly, it might be possible to build a medoid sketch instead of a
> > mean
> > >> based sketch, but this is also unexplored ground.
> > >>
> > >> The virtue of the first approach (using a mmeans sketch as input to
> > >> kmedoids) would be that it would make the kmedoids scalable.
> > >>
> > >>
> > >>
> > >> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic <
> marko.dinic@nissatech.com>
> > >> wrote:
> > >>
> > >> Hello everyone,
> > >>>
> > >>> I have an idea and I would like to get a validation from community
> > about
> > >>> it.
> > >>>
> > >>> In Mahout there is an implementation of Streaming Kmeans. I'm
> > interested
> > >>> in your opinion would it make sense to make a similar implementation
> of
> > >>> Streaming Kmedoids?
> > >>>
> > >>> Kmedoids has even bigger problems than Kmeans because it's not
> > >>> scalable,
> > >>> but can be useful in some cases (e.g. It allows more sophisticated
> > >>> distance
> > >>> measures).
> > >>>
> > >>> What is your opinion about implementation of this?
> > >>>
> > >>> Best regards,
> > >>> Marko
> > >>>
> > >>>
> > >>
> >
>
