mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paritosh Ranjan <pran...@xebia.com>
Subject Re: KMeans job fails during 2nd iteration. Java Heap space
Date Wed, 08 Aug 2012 09:40:28 GMT
A stacktrace of error would have helped in finding the exact error.

However, number of clusters can create Heap Space problems ( If the 
vector dimension is also high ).
Either try to reduce the number of initial clusters ( In my opinion, the 
best way to know about initial clusters is Canopy Clustering 
https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )

or, try to reduce the dimension of the vectors.

PS : you are also providing numClusters twice

--numClusters 1000 \ --numClusters 5 \

On 08-08-2012 10:42, Abramov Pavel wrote:
> Hello,
>
> I am trying to run KMeans example on 15 000 000 documents (seq2sparse output).
> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms document size (titles).
seq2sparse produces 200 files 80 MB each.
>
> My job failed with Java heap space Error. 1st iteration passes while 2nd iteration fails.
On a Map phase of buildClusters I see a lot of warnings, but it passes. Reduce phase of buildClusters
fails with "Java Heap space".
>
> I can not increase reducer/mapper memory in hadoop. My cluster is tunned well.
>
> How can I avoid this situation? My cluster has 300 Mappers and 220 Reducers running 40
servers 8-core 12 GB RAM.
>
> Thanks in advance!
>
> Here is KMeans parameters:
>
> ------------------------------------------------
> mahout kmeans -Dmapred.reduce.tasks=200 \
> -i ...tfidf-vectors/  \
> -o /tmp/clustering_results_kmeans/ \
> --clusters /tmp/clusters/ \
> --numClusters 1000 \
> --numClusters 5 \
> --overwrite \
> --clustering
> ------------------------------------------------
>
> Pavel



Mime
View raw message