mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <>
Subject Re: Avoiding OOM for large datasets
Date Wed, 04 Dec 2013 18:27:22 GMT

This has been reported before by several others (and has been my experience too). The OOM
happens during Canopy Generation phase of Canopy clustering because it only runs with a single

If you are using Mahout 0.8 (or trunk), suggest that u look at the new Streaming Kmeans clustering
which is a quicker and more efficient than the traditional Canopy -> KMeans. 

See the following link for how to run Streaming KMeans.

On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <> wrote:

I've been trying to run Mahout (with Hadoop) on our data for quite sometime
now. Everything is fine on relatively small data sets, but when I try to do
K-Means clustering with the aid of Canopy on like 300000 documents, I can't
even get past the canopy generation because of OOM. We're going to cluster
similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to
desired results on sample data).

I tried setting both "", and
"" to "-Xmx4096M", I also
exported HADOOP_HEAPSIZE to 4000, and still having issues.

I'm running all of this in Hadoop's single node, pseudo-distributed mode on
a machine with 16GB of RAM.

Searching Internet for solutions I found this[1]. One of the bullet points
states that:

    "In all of the algorithms, all clusters are retained in memory by the
mappers and reducers"

So my question is, does Mahout on Hadoop only help in distributing CPU
bound operations? What one should do if they have a large dataset, and only
a handful of low-RAM commodity nodes?

I'm obviously a newbie, thanks for bearing with me.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message