What actually is the significance of s0, s1 and s2? Apologies if it is a
dumb question but I do not find any comments in the code?
On Wed, Sep 26, 2012 at 7:19 PM, Jeff Eastman <jdog@windwardsolutions.com>wrote:
> You did not mention the heap size configured on your cluster. As you
> work on this problem, consider:
>
>  In all of the algorithms, all clusters are retained in memory by the
> mappers and reducers
>  Each cluster holds 4 sparse vectors internally (center, radius, s1 &
> s2)
>  Vectors tend to become more dense as iterations progress due to
> summation of input vectors
>  FuzzyK is the worst offender since it assigns every point to every
> cluster with weight in each iteration
>  Adjust T1=T2 until you get a reasonable number of clusters using
> Canopy
>  Text problems usually generate very wide, sparse vectors but the
> clusters grow in size with iterations due to above
>
>
> On 9/26/12 2:57 AM, paritosh ranjan wrote:
>
> On Wed, Sep 26, 2012 at 4:07 AM, Adair Kovac <adairkovac@gmail.com> <adairkovac@gmail.com>
wrote:
>
>
> Hi folks, I'm running Mahout 0.7 and using the clustering commandline
> tools. Problem is, the only one I can get to supply useful information on
> my data set and small (3node) cluster is kmeans, so far.
>
>
> canopy either groups everything that isn't a starterpoint into one cluster
>
> or gets GC out of memory errors.The "either" is based on my fiddling with t
> values and MAHOUT_HEAPSIZE.
>
>
> Values of t1 and t2 can also play a role here. You can adjust t2 upward
> and that will reduce the number of canopies produced, which might help in
> getting rid of memory issues.
>
>
>
> fkmeans throws Java heap space errors, even after I reduced my vectors set
> to a whopping 24.0 MB (trying for 100 clusters).
>
>
> The Fuzziness constraint might be too fuzzy. You can try with a stricter
> one and loosen it step by step to find the breaking point.
>
>
>
> clusterdump similarly curls up and dies (heap space errors) when I try to
> get it to dump all (or much more than 500 per cluster) of the clustered
> points at the end of my kmeans algorithm.
>
>
> Try to use clusterpp command, its not having any memory problems.https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering
>
> kmeans took over 10 hours to run on 228.3MB of vectors, hitting the max
> iterations of 10. (Right now I'm running it on a 969.0MB vector file,
> hopefully it'll finish successfully.)
>
>
>
> The cluster currently only has 3 nodes, if I understood correctly, maybe
> you can add more nodes to make it fast.
> The KMeans by nature is a multiple iteration algorithm. One thing that can
> be done is to find Canopies first and then run fewer iterations on KMeans
> as the quality will be good if the initial clusters are proper, this can
> significantly reduce total time executed.
>
>
>
>
> I'm using small text documents, so the number + sparsity might be the
> problem.
>
>
>
> Yes, might be.
>
>
>
> Are these issues unusual? Any advice on resolving them? Most of the google
> hits for similar issues just suggest setting MAHOUT_HEAPSIZE to 2048.
>
>
> Some tuning always helps to run it properly.
>
>
>
>

Regards,
Rahul K Mishra,
www.ee.iitb.ac.in/student/~rahulkmishra<http://www.ee.iitb.ac.in/student/%7Erahulkmishra>
