mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jure Jeseni─Źnik <>
Subject Clustering performance
Date Thu, 02 Dec 2010 14:32:56 GMT
I have already explained my mission here:

Using the trial & error method I've managed to found the most appropriate input parameters
for canopy. That would be T1=1.4, T2=1.2 this gives me somewhere around 7000 clusters from
7800 input documents, which is exactly the result I've been looking for. I'm trying to put
together the news from different sources that talk about the same story.
What bothers me now is the performance. To complete this task of processing a 3.6 MB big file,
on my pretty decent 4 core desktop machine,  mahout needs a good 14 minutes. I know I'm dealing
with pretty large number of clusters but, but still. 14 minutes is a huge amount of time.
 If I use a smaller amount of data (1700 docs) it is all over in less than a minute.
When running locally, mahout was only consuming one cpu core? I'm running it on win 7 through
 Cygwin, but it behaved pretty the same on some proper linux machines. How could I make it
use all the available cpu power?
I also tried running this  on a Hadoop cluster, but there seemed to be no significant improvement
in time.  It seemed like  hadoop was unable to properly distribute such a small task.
Is it possible that I missed something here.  What can I do to have this clustering finished
in a bit more decent time.

Thank you for your answers.


Planet 9 d.o.o.
Vojkova 78
1000 Ljubljana
Jure Jeseni─Źnik
Razvijalec aplikacij / Applications developer<>
T + 386 47 30 375
F + 386 1 47 28 550
M + 386 41 363 586

  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message