mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: Averting the (clustering) heapocalypse?
Date Wed, 26 Sep 2012 13:49:31 GMT
You did not mention the heap size configured on your cluster. As you 
work on this problem, consider:

  * In all of the algorithms, all clusters are retained in memory by the
    mappers and reducers
  * Each cluster holds 4 sparse vectors internally (center, radius, s1 & s2)
  * Vectors tend to become more dense as iterations progress due to
    summation of input vectors
  * FuzzyK is the worst offender since it assigns every point to every
    cluster with weight in each iteration
  * Adjust T1=T2 until you get a reasonable number of clusters using Canopy
  * Text problems usually generate very wide, sparse vectors but the
    clusters grow in size with iterations due to above

On 9/26/12 2:57 AM, paritosh ranjan wrote:
> On Wed, Sep 26, 2012 at 4:07 AM, Adair Kovac <> wrote:
>> Hi folks, I'm running Mahout 0.7 and using the clustering commandline
>> tools. Problem is, the only one I can get to supply useful information on
>> my data set and small (3-node) cluster is kmeans, so far.
> canopy either groups everything that isn't a starter-point into one cluster
>> or gets GC out of memory errors.The "either" is based on my fiddling with t
>> values and MAHOUT_HEAPSIZE.
> Values of t1 and t2  can also play a role here. You can adjust t2 upward
> and that will reduce the number of canopies produced, which might help in
> getting rid of memory issues.
>> fkmeans throws Java heap space errors, even after I reduced my vectors set
>> to a whopping 24.0 MB (trying for 100 clusters).
> The Fuzziness constraint might be too fuzzy. You can try with a stricter
> one and loosen it step by step to find the breaking point.
>> clusterdump similarly curls up and dies (heap space errors) when I try to
>> get it to dump all (or much more than 500 per cluster) of the clustered
>> points at the end of my kmeans algorithm.
> Try to use clusterpp command, its not having any memory problems.
>> kmeans took over 10 hours to run on 228.3MB of vectors, hitting the max
>> iterations of 10. (Right now I'm running it on a 969.0MB vector file,
>> hopefully it'll finish successfully.)
> The cluster currently only has 3 nodes, if I understood correctly, maybe
> you can add more nodes to make it fast.
> The KMeans by nature is a multiple iteration algorithm. One thing that can
> be done is to find Canopies first and then run fewer iterations on  KMeans
> as the quality will be good if the initial clusters are proper, this can
> significantly reduce total time executed.
>> I'm using small text documents, so the number + sparsity might be the
>> problem.
> Yes, might be.
>> Are these issues unusual? Any advice on resolving them? Most of the google
>> hits for similar issues just suggest setting MAHOUT_HEAPSIZE to 2048.
> Some tuning always helps to run it properly.

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message