mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jure Jeseničnik <Jure.Jesenic...@planet9.si>
Subject RE: Clustering performance
Date Fri, 03 Dec 2010 06:21:06 GMT
How can I see if the file is splittable or not? If not, how to make it splittable?

Regards.

Jure

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, December 02, 2010 4:49 PM
To: user@mahout.apache.org
Subject: Re: Clustering performance

How many maps does Hadoop schedule?  If the number is small, then you need
to decrease the split size and make sure that your input file is splittable.

2010/12/2 Jure Jeseničnik <Jure.Jesenicnik@planet9.si>

> I have already explained my mission here:
>
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201011.mbox/%3C0EDE11E319B0B043B4F24E0305CABF7C80413134A4@P9MAIL.p9.internal%3E
>
>
>
> Using the trial & error method I’ve managed to found the most appropriate
> input parameters for canopy. That would be T1=1.4, T2=1.2 this gives me
> somewhere around 7000 clusters from 7800 input documents, which is exactly
> the result I’ve been looking for. I’m trying to put together the news from
> different sources that talk about the same story.
>
> What bothers me now is the performance. To complete this task of processing
> a 3.6 MB big file, on my pretty decent 4 core desktop machine,  mahout needs
> a good 14 minutes. I know I’m dealing with pretty large number of clusters
> but, but still. 14 minutes is a huge amount of time.  If I use a smaller
> amount of data (1700 docs) it is all over in less than a minute.
>
> When running locally, mahout was only consuming one cpu core? I’m running
> it on win 7 through  Cygwin, but it behaved pretty the same on some proper
> linux machines. How could I make it use all the available cpu power?
>
> I also tried running this  on a Hadoop cluster, but there seemed to be no
> significant improvement in time.  It seemed like  hadoop was unable to
> properly distribute such a small task.
>
> Is it possible that I missed something here.  What can I do to have this
> clustering finished in a bit more decent time.
>
>
>
> Thank you for your answers.
>
>
>
> Jure
>
>
>
>
>
>
>
> [image: logo-P9]
>
> *Planet 9 d.o.o.*
> Vojkova 78
> 1000 Ljubljana
> Slovenija
> -
> *Jure Jeseničnik*
> Razvijalec aplikacij / Applications developer
> jure.jesenicnik@planet9.si <jure.jesenicnikk@planet9.si>*
> T* + 386 47 30 375
> *F* + 386 1 47 28 550
> *M* + 386 41 363 586
>
>
>
Mime
View raw message