mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuhrmann Alpert, Galit" <galp...@ebay.com>
Subject RE: mahout kmeans not generating clusteredPoint dir?
Date Wed, 17 Jul 2013 09:42:39 GMT

Thanks Suneel.
I tried to add this flag (though I think clusteredPoints directory was supposed to be created
by default?).
Either way, for some reason whenever I add '-cl' (tried to run it on several data sets), I
get the following error: 
"There is no queue named default"
(even though I do specify a queue by -Dmapred.job.queue.name=...).
I don't get this error otherwise.

Has anyone ever encountered this error?
Is there some sort of configuration I'm missing?

Thanks,

Galit.

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Wednesday, July 10, 2013 5:30 PM
To: user@mahout.apache.org
Subject: Re: mahout kmeans not generating clusteredPoint dir?

Been a while since I last worked with this, I believe u r missing the clustering option '-cl'.
Give that a try.




________________________________
 From: "Fuhrmann Alpert, Galit" <galpert@ebay.com>
To: "user@mahout.apache.org" <user@mahout.apache.org> 
Sent: Wednesday, July 10, 2013 5:17 AM
Subject: mahout kmeans not generating clusteredPoint dir?
 

Hello,

I ran mahout kmeans (using rand seeds) on hadoop cluster. It ran successfully and created
a directory containing clusters-*, including the last which was clusters-3-final.
However, it did not create the clusteredPoints, or at least I cannot find it under the same
dir (or anywhere else).

My call was:
mahout kmeansĀ  -k 4000 -i inputSeq.dat -o outputPath --maxIter 3 --clusters outputSeeds

Was there an extra argument I needed to specify in order for it to generate the clusteredPoints?
(BTW I also can't see the outputSeeds. Was it created for seeds and then deleted?)

According to mahout in action:

The k-means clustering implementation creates two types of directories in the output
folder. The clusters-* directories are formed at the end of each iteration: the clusters-0
directory is generated after the first iteration, clusters-1 after the second iteration, and
so on. These directories contain information about the clusters: centroid, standard
deviation, and so on. The clusteredPoints directory, on the other hand, contains the
final mapping from cluster ID to document ID. This data is generated from the output
of the last MapReduce operation.
The directory listing of the output folder looks something like this:
$ ls -l reuters-kmeans-clusters
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-0
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-1
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-2
...
drwxr-xr-x 4 user 5000 136 Feb 1 18:59 clusteredPoint

Again, my call did not generate the clusteredPoint directory.
I would appreciate your help.

Thanks a lot,

Galit.

Mime
View raw message