mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuhrmann Alpert, Galit" <galp...@ebay.com>
Subject help with mahout clustering on hadoop
Date Wed, 24 Jul 2013 08:17:19 GMT

Hi Everyone,

I see this is an active list, so I'm trying again to reach out for help using mahout on hdfs.
I am a research scientist at eBay, working with Big Data analysis for e-commerce. I've been
trying to run mahout on my data for quite some time now- running it locally no problem, but
having problems running it on hdfs.
I hope you have some leads for the following (I accumulated quite a few unresolved issues):

1. I'm trying again to see if anyone has an answer to this matter:
	I've been running mahout kmeans successfully on hdfs, however, if I run mahout kmeans without
the flag -cl, the clusteredPoints directory is not created.
	Whenever I add '-cl' to my call, I get an error: "There is no queue named default", even
though I do specify a queue by -Dmapred.job.queue.name.
	I do not get this error "There is no queue named default" if I don't add the -cl to my call.
It runs just fine. (not creating the clusteredPoints directory though).
	Does anyone have an idea why this happens?
2. My mahout clustering processes seem to be running very slow (several good hours on just
~1M items), and I'm wondering if there's anything that needs to be changed in setting/configuration.
(and how?)
	I'm running on large clusters and could potentially use thousands of nodes. However, my mahout
processes (kmeans/canopy.) are only using max 5 mappers (I tried it on several data sets).

	I've tried to define the number of mappers by something like: -Dmapred.map.tasks=100 but
this didn't seem to have an effect, it still only uses <=5 mappers.
	Is there a different way to set the number of mappers/reducers for a mahout process?
	Or is there another configuration issue I need to consider?
3. When running mahout canopy clustering, the jobs consistently fail, with some out of memory
errors such as:
	attempt_201306241658_137502_m_000001_1: Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError:
Java heap space
	and finally:
	Exception in thread "main" java.lang.InterruptedException: Canopy Job failed processing whateverfilename.dat
	Even though the file does exist.
	I tried to increase the map/red memory by -Dmapred.child.java.opts=-Xmx4g, but this still
fails:
		13/07/22 01:56:09 INFO mapred.JobClient:   Job Counters
		13/07/22 01:56:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23121
		13/07/22 01:56:09 INFO mapred.JobClient:     Total time spent by all reduces waiting after
reserving slots (ms)=0
		13/07/22 01:56:09 INFO mapred.JobClient:     Total time spent by all maps waiting after
reserving slots (ms)=0
		13/07/22 01:56:09 INFO mapred.JobClient:     Launched map tasks=13
		13/07/22 01:56:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
		13/07/22 01:56:09 INFO mapred.JobClient:     Failed map tasks=1
		Exception in thread "main" java.lang.InterruptedException: Canopy Job failed processing
whateverfilename.dat
  		      at org.apache.mahout.clustering.canopy.CanopyDriver.buildClustersMR(CanopyDriver.java:363)
		        at org.apache.mahout.clustering.canopy.CanopyDriver.buildClusters(CanopyDriver.java:248)
		        at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:155)
		        at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
		        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
		        at org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)
		        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
		        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
		        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
		        at java.lang.reflect.Method.invoke(Method.java:597)
		        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
		        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
		        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
		        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
		        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
		        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
		        at java.lang.reflect.Method.invoke(Method.java:597)
		        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
4. One of the first major problems I've encountered was that a mahout jar we've created that
uses KMeansDriver (and that runs great on my local machine) did not even initiate a job on
the hadoop cluster. It seemed to be running parallel but in fact it was running only on the
local node. 	Did this happen to anyone? If so, what is the fix for this? (I ended up dropping
it and calling mahout step by step from command line, but I'd be happy to know if there a
fix for this).

Any ideas/inputs on any of those issues would be really greatly appreciated.

Thanks!

Galit.

-----Original Message-----
From: Fuhrmann Alpert, Galit 
Sent: Wednesday, July 17, 2013 12:43 PM
To: user@mahout.apache.org; 'Suneel Marthi'
Subject: RE: mahout kmeans not generating clusteredPoint dir?


Thanks Suneel.
I tried to add this flag (though I think clusteredPoints directory was supposed to be created
by default?).
Either way, for some reason whenever I add '-cl' (tried to run it on several data sets), I
get the following error: 
"There is no queue named default"
(even though I do specify a queue by -Dmapred.job.queue.name=...).
I don't get this error otherwise.

Has anyone ever encountered this error?
Is there some sort of configuration I'm missing?

Thanks,

Galit.

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Wednesday, July 10, 2013 5:30 PM
To: user@mahout.apache.org
Subject: Re: mahout kmeans not generating clusteredPoint dir?

Been a while since I last worked with this, I believe u r missing the clustering option '-cl'.
Give that a try.




________________________________
 From: "Fuhrmann Alpert, Galit" <galpert@ebay.com>
To: "user@mahout.apache.org" <user@mahout.apache.org> 
Sent: Wednesday, July 10, 2013 5:17 AM
Subject: mahout kmeans not generating clusteredPoint dir?
 

Hello,

I ran mahout kmeans (using rand seeds) on hadoop cluster. It ran successfully and created
a directory containing clusters-*, including the last which was clusters-3-final.
However, it did not create the clusteredPoints, or at least I cannot find it under the same
dir (or anywhere else).

My call was:
mahout kmeansĀ  -k 4000 -i inputSeq.dat -o outputPath --maxIter 3 --clusters outputSeeds

Was there an extra argument I needed to specify in order for it to generate the clusteredPoints?
(BTW I also can't see the outputSeeds. Was it created for seeds and then deleted?)

According to mahout in action:

The k-means clustering implementation creates two types of directories in the output
folder. The clusters-* directories are formed at the end of each iteration: the clusters-0
directory is generated after the first iteration, clusters-1 after the second iteration, and
so on. These directories contain information about the clusters: centroid, standard
deviation, and so on. The clusteredPoints directory, on the other hand, contains the
final mapping from cluster ID to document ID. This data is generated from the output
of the last MapReduce operation.
The directory listing of the output folder looks something like this:
$ ls -l reuters-kmeans-clusters
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-0
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-1
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-2
...
drwxr-xr-x 4 user 5000 136 Feb 1 18:59 clusteredPoint

Again, my call did not generate the clusteredPoint directory.
I would appreciate your help.

Thanks a lot,

Galit.

Mime
View raw message