Also, with 500MB of data, this is likely to only take a few minutes on a
single machine with the new clustering stuff. It is hard to estimate
precisely, however, due to the difference between dense and sparse cases.
On Wed, Sep 12, 2012 at 8:42 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
> 200 iterations?
>
> What is your convergence delta? If it is too small for your distance
> measure you will perform all 200 iterations, every time you cluster.
>
> convergenceDelta (cd) convergenceDelta
> The convergence delta value.
> Default is 0.5
>
> I would set the convergence delta looser and see if 100 or even 20
> iterations produces good results. You can always tweak your other
> parameters to get them tuned and up your convergence if needed. Also
> remember that a good convergence is related to your distance measure so you
> need to think about which distance measure works for your data.
>
> I generally only take 1020 iterations using cosine distance and 0.001 as
> the convergence delta, which would be 2040 minutes for you.
>
> On Sep 12, 2012, at 7:26 PM, Elaine Gan <elainegan@gmo.jp> wrote:
>
> Hi,
>
> I'm trying to do some text analysis using mahout kmeans (clustering),
> processing the data on hadoop.
> numClusters = 160
> maxIter (x) maxIter = 200
>
> Well my data is small, around 500MB .
> I have 4 servers, each with 4CPU and TaskTrackers are set to 4 as
> maximum.
> When i run the mahout task, i can see that the number of map tasks are
> the most 3, so i guess i do not need to do any tuning on this at this
> moment.
>
> One iteration took around 1.5mins ~ 2mins to finish.
> I am not sure whether this is normal or is it consider slow, can anyone
> gives me an advice on this?
>
> And with x = 200, it tooks me around 200x2mins = 6 hours
> to finish the whole analysis..
> Is it something which is unavoided?
> The bigger the "x" is, the longer time it takes to finish the kmeans job?
>
> Any ways to improve on the mahout kmeans to speed it up?
>
> Thank you.
>
>
>
