mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Mishra <mishra.rah...@gmail.com>
Subject Re: Averting the (clustering) heapocalypse?
Date Thu, 27 Sep 2012 10:28:34 GMT
What actually is the significance of s0, s1 and s2? Apologies if it is a
dumb question but I do not find any comments in the code?

On Wed, Sep 26, 2012 at 7:19 PM, Jeff Eastman <jdog@windwardsolutions.com>wrote:

>  You did not mention the heap size configured on your cluster. As you
> work on this problem, consider:
>
>    - In all of the algorithms, all clusters are retained in memory by the
>    mappers and reducers
>    - Each cluster holds 4 sparse vectors internally (center, radius, s1 &
>    s2)
>    - Vectors tend to become more dense as iterations progress due to
>    summation of input vectors
>     - FuzzyK is the worst offender since it assigns every point to every
>    cluster with weight in each iteration
>    - Adjust T1=T2 until you get a reasonable number of clusters using
>    Canopy
>    - Text problems usually generate very wide, sparse vectors but the
>    clusters grow in size with iterations due to above
>
>
> On 9/26/12 2:57 AM, paritosh ranjan wrote:
>
> On Wed, Sep 26, 2012 at 4:07 AM, Adair Kovac <adairkovac@gmail.com> <adairkovac@gmail.com>
wrote:
>
>
>  Hi folks, I'm running Mahout 0.7 and using the clustering commandline
> tools. Problem is, the only one I can get to supply useful information on
> my data set and small (3-node) cluster is kmeans, so far.
>
>
>  canopy either groups everything that isn't a starter-point into one cluster
>
>  or gets GC out of memory errors.The "either" is based on my fiddling with t
> values and MAHOUT_HEAPSIZE.
>
>
>  Values of t1 and t2  can also play a role here. You can adjust t2 upward
> and that will reduce the number of canopies produced, which might help in
> getting rid of memory issues.
>
>
>
>  fkmeans throws Java heap space errors, even after I reduced my vectors set
> to a whopping 24.0 MB (trying for 100 clusters).
>
>
>  The Fuzziness constraint might be too fuzzy. You can try with a stricter
> one and loosen it step by step to find the breaking point.
>
>
>
>  clusterdump similarly curls up and dies (heap space errors) when I try to
> get it to dump all (or much more than 500 per cluster) of the clustered
> points at the end of my kmeans algorithm.
>
>
>  Try to use clusterpp command, its not having any memory problems.https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering
>
>  kmeans took over 10 hours to run on 228.3MB of vectors, hitting the max
> iterations of 10. (Right now I'm running it on a 969.0MB vector file,
> hopefully it'll finish successfully.)
>
>
>
>  The cluster currently only has 3 nodes, if I understood correctly, maybe
> you can add more nodes to make it fast.
> The KMeans by nature is a multiple iteration algorithm. One thing that can
> be done is to find Canopies first and then run fewer iterations on  KMeans
> as the quality will be good if the initial clusters are proper, this can
> significantly reduce total time executed.
>
>
>
>
>  I'm using small text documents, so the number + sparsity might be the
> problem.
>
>
>
>  Yes, might be.
>
>
>
>  Are these issues unusual? Any advice on resolving them? Most of the google
> hits for similar issues just suggest setting MAHOUT_HEAPSIZE to 2048.
>
>
>  Some tuning always helps to run it properly.
>
>
>
>


-- 
Regards,
Rahul K Mishra,
www.ee.iitb.ac.in/student/~rahulkmishra<http://www.ee.iitb.ac.in/student/%7Erahulkmishra>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message