mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Bonerz <jbon...@googlemail.com>
Subject Re: What are the best settings for my clustering task
Date Wed, 02 Oct 2013 16:11:46 GMT
Isn't the streaming k-means just a different approach to crunch through the
data? In other words, the result of streaming k-means should be comparable
to using k-means in multiple chained map reduce cycles?

I just read a paper about the k-means clustering and its underlying
algorithm.

According to that paper, k-means relies on a preknown/predefined amount of
clusters as an input parameter.

Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf

In my current scenario however, the number of clusters is unknown at the
beginning.

Maybe k-means is just not the right algorithm for clustering similar
products based on their short description text? What else could I use?




2013/10/1 Ted Dunning <ted.dunning@gmail.com>

> At such small sizes, I would guess that the sequential version of the
> streaming k-means or ball k-means would be better options.
>
>
>
> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <jbonerz@googlemail.com
> >wrote:
>
> > Hello all,
> >
> > I am currently trying create clusters from a group of 50.000 strings that
> > contain product descriptions (around 70-100 characters length each).
> >
> > That group of 50.000 consists of roughly 5.000 individual products and
> ten
> > varying product descriptions per product. The product descriptions are
> > already prepared for clustering and contain a normalized brand name,
> > product
> > model number, etc.
> >
> > What would be a good approach to maximise the amound of found clusters
> (the
> > best possible value would be 5.000 clusters with 10 products each)
> >
> > I adapted the reuters cluster script to read in my data and managed to
> > create a first set of clusters. However, I have not managed to maximise
> the
> > cluster count.
> >
> > The question is: what do I need to tweak with regard to the available
> > mahout
> > settings, so the clusters are created as precisely as possible?
> >
> > Many regards!
> > Jens
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > Sent from the Mahout User List mailing list archive at Nabble.com.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message