mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: Is mahout kmeans slow ?
Date Thu, 13 Sep 2012 15:26:13 GMT
Actually if it is really taking 200 iterations then it is never matching your convergence delta.
That means either your data does not cluster well or you convergence delta is still to tight.

I was suggesting that you loosen the convergence delta until it only takes 10-20 iterations
to cluster then look at the data, tune your other parameters, scrub you input etc. before
tightening your delta. If it takes 6 hours to cluster then tuning your other params will take
too long so do them first.

On Sep 13, 2012, at 7:59 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:

What distance measure?

On Sep 12, 2012, at 10:37 PM, Elaine Gan <elaine-gan@gmo.jp> wrote:

My -cd was quite loose, set it at 0.1

Hmm.. maybe the data is too small, causing the low performance..?


> 200 iterations?
> 
> What is your convergence delta? If it is too small for your distance measure you will
perform all 200 iterations, every time you cluster. 
> 
> --convergenceDelta (-cd) convergenceDelta                  
>         The convergence delta value.       
>          Default is 0.5  
> 
> I would set the convergence delta looser and see if 100 or even 20 iterations produces
good results. You can always tweak your other parameters to get them tuned and up your convergence
if needed. Also remember that a good convergence is related to your distance measure so you
need to think about which distance measure works for your data.
> 
> I generally only take 10-20 iterations using cosine distance and 0.001 as the convergence
delta, which would be 20-40 minutes for you.
> 
> On Sep 12, 2012, at 7:26 PM, Elaine Gan <elaine-gan@gmo.jp> wrote:
> 
> Hi,
> 
> I'm trying to do some text analysis using mahout kmeans (clustering),
> processing the data on hadoop.
> --numClusters = 160 
> --maxIter (-x) maxIter = 200
> 
> Well my data is small, around 500MB .
> I have 4 servers, each with 4CPU and TaskTrackers are set to 4 as
> maximum.
> When i run the mahout task, i can see that the number of map tasks are
> the most 3, so i guess i do not need to do any tuning on this at this
> moment.
> 
> One iteration took around 1.5mins ~ 2mins to finish.
> I am not sure whether this is normal or is it consider slow, can anyone
> gives me an advice on this?
> 
> And with x = 200, it tooks me around 200x2mins = 6 hours 
> to finish the whole analysis..
> Is it something which is unavoided?
> The bigger the "x" is, the longer time it takes to finish the kmeans job?
> 
> Any ways to improve on the mahout kmeans to speed it up?
> 
> Thank you.
> 




Mime
View raw message