mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Streaming K-means
Date Tue, 16 Jun 2015 21:34:17 GMT
 "streaming k-means" is something else afaik. Streaming k-means is reserved
for a particular k-means method (in Mahout, at least, [1]).

Whereas as far as i understand what mllib calls "streaming k-means" is name
given by mllib contributor which really means "online k-means", i.e. radar
tracking of centroids over time over stream of (x_i, t_i) pairs and that
uses Spark streaming, but has nothing to do with Shindler et. al. method.

At least that was our understanding last time we looked at the issue of the
names here.

So this issue has come several times already when people come and say,
"what? streaming k-means? mllib has streaming k-means" whereas everybody
else is talking about something else.

[1]
http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf

On Tue, Jun 16, 2015 at 2:04 PM, RJ Nowling <rnowling@gmail.com> wrote:

> There is a streaming k-means implementation in MLlib that uses reservoir
> sampling.
>
> On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic <marko.dinic@nissatech.com>
> wrote:
>
> > Ted,
> >
> > Thank you for your answer.
> >
> > What would you then recommend me to do? My idea is to implement it to
> > enable clustering of time series using DTW (Dynamic Time Warping) as
> > distance measure. As you know, the main problem is that K-medoids is not
> > scalable, so that's standing on my way. Of course, it could be used with
> > other distances as well.
> >
> > I have already implemented something that I consider a scalable
> K-medoids,
> > based on using pivots to speed up medoid selection (
> > https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This
> > works for distance measures such as Euclidean, has some limitations (best
> > results are in case of normal distribution, outliers could be a problem),
> > but it works pretty good (considering the computations). The thing is, it
> > can't be used with DTW, since it relies on projections, while triangle
> > inequality for DTW does not hold. That is why I'm considering this
> > Streaming approach now.
> >
> > Would you think that it is worthy of giving a shot? I'm really stretching
> > for a scalable solution.
> >
> > Best regards,
> > Marko
> >
> >
> > On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote:
> >
> >> The streaming k-means works by building a sketch of the data which is
> then
> >> used to do real clustering.
> >>
> >> It might be that this sketch would be acceptable to do k-medoids, but
> that
> >> is definitely not guaranteed.
> >>
> >> Similarly, it might be possible to build a medoid sketch instead of a
> mean
> >> based sketch, but this is also unexplored ground.
> >>
> >> The virtue of the first approach (using a m-means sketch as input to
> >> k-medoids) would be that it would make the k-medoids scalable.
> >>
> >>
> >>
> >> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic <marko.dinic@nissatech.com>
> >> wrote:
> >>
> >>  Hello everyone,
> >>>
> >>> I have an idea and I would like to get a validation from community
> about
> >>> it.
> >>>
> >>> In Mahout there is an implementation of Streaming K-means. I'm
> interested
> >>> in your opinion would it make sense to make a similar implementation of
> >>> Streaming K-medoids?
> >>>
> >>> K-medoids has even bigger problems than K-means because it's not
> >>> scalable,
> >>> but can be useful in some cases (e.g. It allows more sophisticated
> >>> distance
> >>> measures).
> >>>
> >>> What is your opinion about implementation of this?
> >>>
> >>> Best regards,
> >>> Marko
> >>>
> >>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message