mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: What are the best settings for my clustering task
Date Wed, 02 Oct 2013 16:59:11 GMT
The way that the new streaming k-means works is that there is a first
sketch pass which only requires an upper bound on the final number of
clusters you will want.  It adaptively creates more or less clusters
depending on the data and your bound.  This sketch is guaranteed to be
computed within at most one map-reduce pass.  There is a threaded version
that runs (fast) on a single machine.  The threaded version is liable to be
faster than the map-reduce version for moderate or smaller data sizes.

That sketch can then be used to do all kinds of things that rely on
Euclidean distance and still get results within a small factor of the same
algorithm applied to all of the data.  Typically this second phase is a
ball k-means algorithm, but it could easily be a dp-means algorithm [1] if
you want a variable number of clusters.  Indeed, you could run many
dp-means passes with different values of lambda on the same sketch.  Note
that the sketch is small enough that in-memory clustering is entirely
viable and is very fast.

For the problem you describe, however, you probably don't need the sketch
approach at all and can probably apply ball k-means or dp-means directly.
 Running many k-means clusterings with differing values of k should be
entirely feasible as well with such data sizes.

[1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf




On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jbonerz@googlemail.com> wrote:

> Isn't the streaming k-means just a different approach to crunch through the
> data? In other words, the result of streaming k-means should be comparable
> to using k-means in multiple chained map reduce cycles?
>
> I just read a paper about the k-means clustering and its underlying
> algorithm.
>
> According to that paper, k-means relies on a preknown/predefined amount of
> clusters as an input parameter.
>
> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
>
> In my current scenario however, the number of clusters is unknown at the
> beginning.
>
> Maybe k-means is just not the right algorithm for clustering similar
> products based on their short description text? What else could I use?
>
>
>
>
> 2013/10/1 Ted Dunning <ted.dunning@gmail.com>
>
> > At such small sizes, I would guess that the sequential version of the
> > streaming k-means or ball k-means would be better options.
> >
> >
> >
> > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <jbonerz@googlemail.com
> > >wrote:
> >
> > > Hello all,
> > >
> > > I am currently trying create clusters from a group of 50.000 strings
> that
> > > contain product descriptions (around 70-100 characters length each).
> > >
> > > That group of 50.000 consists of roughly 5.000 individual products and
> > ten
> > > varying product descriptions per product. The product descriptions are
> > > already prepared for clustering and contain a normalized brand name,
> > > product
> > > model number, etc.
> > >
> > > What would be a good approach to maximise the amound of found clusters
> > (the
> > > best possible value would be 5.000 clusters with 10 products each)
> > >
> > > I adapted the reuters cluster script to read in my data and managed to
> > > create a first set of clusters. However, I have not managed to maximise
> > the
> > > cluster count.
> > >
> > > The question is: what do I need to tweak with regard to the available
> > > mahout
> > > settings, so the clusters are created as precisely as possible?
> > >
> > > Many regards!
> > > Jens
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message