mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Failure to run Clustering example
Date Mon, 11 May 2009 21:52:27 GMT
I don't see anything obviously canopy-related in the logs. Canopy 
serializes the vectors but the storage representation should not be too 
inefficient.

If T1 and T2 are too small relative to your observed distance measures 
you will get a LOT of canopies, potentially one per document. How many 
did you get in your run? For 1000 vectors of 100 terms; however, it does 
seem that something is unusual here. I've run canopy (on a 12 node 
cluster) with millions of 30-element DenseVector input points and not 
seen these sorts of numbers. It is possible you are thrashing your RAM. 
Have you thought about getting an EC2 instance or two? I think we are 
currently ok with elastic MR too but have not tried that yet.

I would not expect the reducer to start until all the mappers are done.

I'm back stateside Wednesday from Oz and will be able to take a look 
later in the week. I also notice canopy still has the combiner problem 
we fixed in kMeans and won't work if the combiner does not run. It's 
darned unfortunate there isn't an option to require the combiner. More 
to think about...

Jeff


Shashikant Kore wrote:
> On Wed, May 6, 2009 at 6:45 AM, Grant Ingersoll <gsingers@apache.org> wrote:
>   
>>> 2. To create canopies for 1000 documents it took almost 75 minutes.
>>> Though the total number of unique terms in the index is 50,000 each
>>> vector has less than 100 unique terms. (ie each document vector is a
>>> sparse vector of cardinality 50,000 and 100 elements.) The hardware is
>>> admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
>>> Hadoop has one node.  Values of T1 and T2 were 80 and 55 respectively,
>>> as given in the sample program.
>>>       
>> Have you profiled it?  Would be good to see where the issue is coming from.
>>
>>     
>
> Apologies for reverting late.
>
> I ran clustering on 100 documents with profile flag in hadoop set to
> true. Canopy mapper took an hour and Reducer took 32 mins to generate
> these results.  The Canopy Clustering job is yet to finish. Here are
> the relevant outputs.
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out  (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>     1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
>     2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
> java.lang.Integer
>     3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
>     4  3.03% 96.41%   3567216 222951 690373248 43148328 305480 java.lang.Integer
>     5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>     1 77.67% 77.67%  99614736    1  99614736     1 304245 byte[]
>     2 10.66% 88.33%  13676528 854783 2037966768 127372923 304840
> java.lang.Integer
>     3  5.58% 93.91%   7158048 447378 359948080 22496755 305451 java.lang.Integer
>     4  3.07% 96.98%   3932176    1   3932176     1 304274 int[]
>     5  1.02% 98.00%   1310736    1   1310736     1 304272 int[]
>
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>     1 10.16% 10.16%    253112 1594   1140784  6850 300008 char[]
>     2  9.07% 19.23%    225936   64    946288   266 300184 byte[]
>     3  9.06% 28.29%    225816   64    895128   232 300781 byte[]
>     4  2.63% 30.92%     65552    1     65552     1 302380 byte[]
>     5  1.97% 32.89%     49048  130    252256   700 300056 byte[]
>     6  1.51% 34.39%     37528  260    186896  1229 300086 char[]
>
>
> Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
>  (Reducer)
>  rank   self  accum     bytes objs     bytes  objs trace name
>     1 12.29% 12.29%    677088 42318 1811526016 113220376 306902
> java.lang.Integer
>     2 12.25% 24.53%    674816 42176 108428384 6776774 307108 java.lang.Integer
>     3 11.52% 36.05%    634696  102   3574600 10233 300008 char[]
>     4 10.64% 46.69%    586128 24422   1804296 75179 306879
> java.util.HashMap$Entry
>     5  7.09% 53.78%    390752 24422   4535616 283476 306878 java.lang.Double
>     6  7.06% 60.84%    389248 24328   4519120 282445 306880 java.lang.Integer
>     7  3.96% 64.80%    218224   74    359448  2939 303276 byte[]
>
>
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out  (Mapper)
>
> rank   self  accum     bytes objs     bytes  objs trace name
>     1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
>     2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
> java.lang.Integer
>     3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
>     4  3.03% 96.41%   3567216 222951 690373248 43148328 305480 java.lang.Integer
>     5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out  (Mapper)
> rank   self  accum   count trace method
>    1 96.85% 96.85%  347772 304838 java.lang.Object.<init>
>    2  0.34% 97.18%    1203 305459 java.lang.Integer.hashCode
>    3  0.33% 97.51%    1168 304841 java.lang.Integer.hashCode
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
> rank   self  accum   count trace method
>    1  5.59%  5.59%      32 300866 java.lang.ClassLoader.findBootstrapClass
>    2  4.20%  9.79%      24 300859 java.util.zip.ZipFile.read
>    3  3.67% 13.46%      21 301341 java.util.TimeZone.getSystemTimeZoneID
>    4  2.45% 15.91%      14 300119 java.util.zip.ZipFile.open
>    5  2.45% 18.36%      14 301365 java.io.UnixFileSystem.getLength
>    6  2.27% 20.63%      13 300857 java.lang.ClassLoader.defineClass1
>
>
> Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
>  (Reducer)
> rank   self  accum   count trace method
>    1 93.77% 93.77%  236947 304890 java.lang.Object.<init>
>    2  1.46% 95.23%    3693 311379 sun.nio.ch.EPollArrayWrapper.epollWait
>
>
> I also took a heap dump when Mapper was running. 98% of the memory was
> used by the byte arrays allocated/referenced in
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>
> The document vectors for input set (of 100 docs) is available here.
> http://docs.google.com/Doc?id=dc5kkrf9_110fqtc63c3
>
> I create canopies with following command.
>
> $bin/hadoop jar ../mahout-examples-0.1.job
> org.apache.mahout.clustering.canopy.CanopyClusteringJob test100
> output/ org.apache.mahout.utils.EuclideanDistanceMeasure 80 55
>
> The t1, t2 values are the ones which were given for synthetic data
> example. Should the values of t1 and t2 affect the runtime
> dramatically?
>
> Thanks,
>
> --shashi
>
>
>   


Mime
View raw message