mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paritosh ranjan <paritoshranj...@gmail.com>
Subject Re: Averting the (clustering) heapocalypse?
Date Wed, 26 Sep 2012 06:57:47 GMT
On Wed, Sep 26, 2012 at 4:07 AM, Adair Kovac <adairkovac@gmail.com> wrote:

> Hi folks, I'm running Mahout 0.7 and using the clustering commandline
> tools. Problem is, the only one I can get to supply useful information on
> my data set and small (3-node) cluster is kmeans, so far.
>

canopy either groups everything that isn't a starter-point into one cluster
> or gets GC out of memory errors.The "either" is based on my fiddling with t
> values and MAHOUT_HEAPSIZE.
>

Values of t1 and t2  can also play a role here. You can adjust t2 upward
and that will reduce the number of canopies produced, which might help in
getting rid of memory issues.


> fkmeans throws Java heap space errors, even after I reduced my vectors set
> to a whopping 24.0 MB (trying for 100 clusters).
>

The Fuzziness constraint might be too fuzzy. You can try with a stricter
one and loosen it step by step to find the breaking point.


> clusterdump similarly curls up and dies (heap space errors) when I try to
> get it to dump all (or much more than 500 per cluster) of the clustered
> points at the end of my kmeans algorithm.
>

Try to use clusterpp command, its not having any memory problems.
https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering


> kmeans took over 10 hours to run on 228.3MB of vectors, hitting the max
> iterations of 10. (Right now I'm running it on a 969.0MB vector file,
> hopefully it'll finish successfully.)
>
>
The cluster currently only has 3 nodes, if I understood correctly, maybe
you can add more nodes to make it fast.
The KMeans by nature is a multiple iteration algorithm. One thing that can
be done is to find Canopies first and then run fewer iterations on  KMeans
as the quality will be good if the initial clusters are proper, this can
significantly reduce total time executed.



> I'm using small text documents, so the number + sparsity might be the
> problem.
>
>
Yes, might be.


> Are these issues unusual? Any advice on resolving them? Most of the google
> hits for similar issues just suggest setting MAHOUT_HEAPSIZE to 2048.
>

Some tuning always helps to run it properly.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message