mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adair Kovac <adairko...@gmail.com>
Subject Averting the (clustering) heapocalypse?
Date Tue, 25 Sep 2012 22:37:29 GMT
Hi folks, I'm running Mahout 0.7 and using the clustering commandline
tools. Problem is, the only one I can get to supply useful information on
my data set and small (3-node) cluster is kmeans, so far.

canopy either groups everything that isn't a starter-point into one cluster
or gets GC out of memory errors.The "either" is based on my fiddling with t
values and MAHOUT_HEAPSIZE.

fkmeans throws Java heap space errors, even after I reduced my vectors set
to a whopping 24.0 MB (trying for 100 clusters).

clusterdump similarly curls up and dies (heap space errors) when I try to
get it to dump all (or much more than 500 per cluster) of the clustered
points at the end of my kmeans algorithm.

kmeans took over 10 hours to run on 228.3MB of vectors, hitting the max
iterations of 10. (Right now I'm running it on a 969.0MB vector file,
hopefully it'll finish successfully.)

I'm using small text documents, so the number + sparsity might be the
problem.

Are these issues unusual? Any advice on resolving them? Most of the google
hits for similar issues just suggest setting MAHOUT_HEAPSIZE to 2048.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message