mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Bonerz <jbon...@googlemail.com>
Subject Re: What are the best settings for my clustering task
Date Wed, 02 Oct 2013 17:21:32 GMT
thx for your elaborate answer.

so if the upper bound on the final number of clusters is unknown in the
beginning, what would happen, if I define a very high number that is
guaranteed to be > the estimated number of clusters.
for example if I set it to 10.000 clusters if an estimate of 5.000 is
likely, will that work?


2013/10/2 Ted Dunning <ted.dunning@gmail.com>

> The way that the new streaming k-means works is that there is a first
> sketch pass which only requires an upper bound on the final number of
> clusters you will want.  It adaptively creates more or less clusters
> depending on the data and your bound.  This sketch is guaranteed to be
> computed within at most one map-reduce pass.  There is a threaded version
> that runs (fast) on a single machine.  The threaded version is liable to be
> faster than the map-reduce version for moderate or smaller data sizes.
>
> That sketch can then be used to do all kinds of things that rely on
> Euclidean distance and still get results within a small factor of the same
> algorithm applied to all of the data.  Typically this second phase is a
> ball k-means algorithm, but it could easily be a dp-means algorithm [1] if
> you want a variable number of clusters.  Indeed, you could run many
> dp-means passes with different values of lambda on the same sketch.  Note
> that the sketch is small enough that in-memory clustering is entirely
> viable and is very fast.
>
> For the problem you describe, however, you probably don't need the sketch
> approach at all and can probably apply ball k-means or dp-means directly.
>  Running many k-means clusterings with differing values of k should be
> entirely feasible as well with such data sizes.
>
> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
>
>
>
>
> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jbonerz@googlemail.com>
> wrote:
>
> > Isn't the streaming k-means just a different approach to crunch through
> the
> > data? In other words, the result of streaming k-means should be
> comparable
> > to using k-means in multiple chained map reduce cycles?
> >
> > I just read a paper about the k-means clustering and its underlying
> > algorithm.
> >
> > According to that paper, k-means relies on a preknown/predefined amount
> of
> > clusters as an input parameter.
> >
> > Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> >
> > In my current scenario however, the number of clusters is unknown at the
> > beginning.
> >
> > Maybe k-means is just not the right algorithm for clustering similar
> > products based on their short description text? What else could I use?
> >
> >
> >
> >
> > 2013/10/1 Ted Dunning <ted.dunning@gmail.com>
> >
> > > At such small sizes, I would guess that the sequential version of the
> > > streaming k-means or ball k-means would be better options.
> > >
> > >
> > >
> > > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <jbonerz@googlemail.com
> > > >wrote:
> > >
> > > > Hello all,
> > > >
> > > > I am currently trying create clusters from a group of 50.000 strings
> > that
> > > > contain product descriptions (around 70-100 characters length each).
> > > >
> > > > That group of 50.000 consists of roughly 5.000 individual products
> and
> > > ten
> > > > varying product descriptions per product. The product descriptions
> are
> > > > already prepared for clustering and contain a normalized brand name,
> > > > product
> > > > model number, etc.
> > > >
> > > > What would be a good approach to maximise the amound of found
> clusters
> > > (the
> > > > best possible value would be 5.000 clusters with 10 products each)
> > > >
> > > > I adapted the reuters cluster script to read in my data and managed
> to
> > > > create a first set of clusters. However, I have not managed to
> maximise
> > > the
> > > > cluster count.
> > > >
> > > > The question is: what do I need to tweak with regard to the available
> > > > mahout
> > > > settings, so the clusters are created as precisely as possible?
> > > >
> > > > Many regards!
> > > > Jens
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > > >
> > >
> >
>



-- 
CEO
Hightech Marketing Group
Cell/Mobile: +49 173 539 3588

____

Hightech Marketing Group
Frankenstraße 32
50354 Huerth
Germany
Phone: +49 (0)2233 – 619 2741
Fax: +49 (0)2233 – 619 27419
Web: www.hightechmg.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message