You did not mention the heap size configured on your cluster. As you
work on this problem, consider:
* In all of the algorithms, all clusters are retained in memory by the
mappers and reducers
* Each cluster holds 4 sparse vectors internally (center, radius, s1 & s2)
* Vectors tend to become more dense as iterations progress due to
summation of input vectors
* FuzzyK is the worst offender since it assigns every point to every
cluster with weight in each iteration
* Adjust T1=T2 until you get a reasonable number of clusters using Canopy
* Text problems usually generate very wide, sparse vectors but the
clusters grow in size with iterations due to above
On 9/26/12 2:57 AM, paritosh ranjan wrote:
> On Wed, Sep 26, 2012 at 4:07 AM, Adair Kovac <adairkovac@gmail.com> wrote:
>
>> Hi folks, I'm running Mahout 0.7 and using the clustering commandline
>> tools. Problem is, the only one I can get to supply useful information on
>> my data set and small (3node) cluster is kmeans, so far.
>>
> canopy either groups everything that isn't a starterpoint into one cluster
>> or gets GC out of memory errors.The "either" is based on my fiddling with t
>> values and MAHOUT_HEAPSIZE.
>>
> Values of t1 and t2 can also play a role here. You can adjust t2 upward
> and that will reduce the number of canopies produced, which might help in
> getting rid of memory issues.
>
>
>> fkmeans throws Java heap space errors, even after I reduced my vectors set
>> to a whopping 24.0 MB (trying for 100 clusters).
>>
> The Fuzziness constraint might be too fuzzy. You can try with a stricter
> one and loosen it step by step to find the breaking point.
>
>
>> clusterdump similarly curls up and dies (heap space errors) when I try to
>> get it to dump all (or much more than 500 per cluster) of the clustered
>> points at the end of my kmeans algorithm.
>>
> Try to use clusterpp command, its not having any memory problems.
> https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering
>
>
>> kmeans took over 10 hours to run on 228.3MB of vectors, hitting the max
>> iterations of 10. (Right now I'm running it on a 969.0MB vector file,
>> hopefully it'll finish successfully.)
>>
>>
> The cluster currently only has 3 nodes, if I understood correctly, maybe
> you can add more nodes to make it fast.
> The KMeans by nature is a multiple iteration algorithm. One thing that can
> be done is to find Canopies first and then run fewer iterations on KMeans
> as the quality will be good if the initial clusters are proper, this can
> significantly reduce total time executed.
>
>
>
>> I'm using small text documents, so the number + sparsity might be the
>> problem.
>>
>>
> Yes, might be.
>
>
>> Are these issues unusual? Any advice on resolving them? Most of the google
>> hits for similar issues just suggest setting MAHOUT_HEAPSIZE to 2048.
>>
> Some tuning always helps to run it properly.
>
