mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Failure to run Clustering example
Date Tue, 12 May 2009 10:56:46 GMT
Is it possible to share the code and the 100 docs?  If not, can you  
reproduce with synthetic data?

-Grant

On May 11, 2009, at 9:38 AM, Shashikant Kore wrote:

> On Wed, May 6, 2009 at 6:45 AM, Grant Ingersoll  
> <gsingers@apache.org> wrote:
>>
>>>
>>> 2. To create canopies for 1000 documents it took almost 75 minutes.
>>> Though the total number of unique terms in the index is 50,000 each
>>> vector has less than 100 unique terms. (ie each document vector is a
>>> sparse vector of cardinality 50,000 and 100 elements.) The  
>>> hardware is
>>> admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
>>> Hadoop has one node.  Values of T1 and T2 were 80 and 55  
>>> respectively,
>>> as given in the sample program.
>>
>> Have you profiled it?  Would be good to see where the issue is  
>> coming from.
>>
>
> Apologies for reverting late.
>
> I ran clustering on 100 documents with profile flag in hadoop set to
> true. Canopy mapper took an hour and Reducer took 32 mins to generate
> these results.  The Canopy Clustering job is yet to finish. Here are
> the relevant outputs.
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/ 
> profile.out  (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>    1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
>    2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
> java.lang.Integer
>    3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
>    4  3.03% 96.41%   3567216 222951 690373248 43148328 305480  
> java.lang.Integer
>    5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/ 
> profile.out (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>    1 77.67% 77.67%  99614736    1  99614736     1 304245 byte[]
>    2 10.66% 88.33%  13676528 854783 2037966768 127372923 304840
> java.lang.Integer
>    3  5.58% 93.91%   7158048 447378 359948080 22496755 305451  
> java.lang.Integer
>    4  3.07% 96.98%   3932176    1   3932176     1 304274 int[]
>    5  1.02% 98.00%   1310736    1   1310736     1 304272 int[]
>
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/ 
> profile.out (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>    1 10.16% 10.16%    253112 1594   1140784  6850 300008 char[]
>    2  9.07% 19.23%    225936   64    946288   266 300184 byte[]
>    3  9.06% 28.29%    225816   64    895128   232 300781 byte[]
>    4  2.63% 30.92%     65552    1     65552     1 302380 byte[]
>    5  1.97% 32.89%     49048  130    252256   700 300056 byte[]
>    6  1.51% 34.39%     37528  260    186896  1229 300086 char[]
>
>
> Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
> (Reducer)
> rank   self  accum     bytes objs     bytes  objs trace name
>    1 12.29% 12.29%    677088 42318 1811526016 113220376 306902
> java.lang.Integer
>    2 12.25% 24.53%    674816 42176 108428384 6776774 307108  
> java.lang.Integer
>    3 11.52% 36.05%    634696  102   3574600 10233 300008 char[]
>    4 10.64% 46.69%    586128 24422   1804296 75179 306879
> java.util.HashMap$Entry
>    5  7.09% 53.78%    390752 24422   4535616 283476 306878  
> java.lang.Double
>    6  7.06% 60.84%    389248 24328   4519120 282445 306880  
> java.lang.Integer
>    7  3.96% 64.80%    218224   74    359448  2939 303276 byte[]
>
>
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/ 
> profile.out  (Mapper)
>
> rank   self  accum     bytes objs     bytes  objs trace name
>    1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
>    2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
> java.lang.Integer
>    3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
>    4  3.03% 96.41%   3567216 222951 690373248 43148328 305480  
> java.lang.Integer
>    5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/ 
> profile.out  (Mapper)
> rank   self  accum   count trace method
>   1 96.85% 96.85%  347772 304838 java.lang.Object.<init>
>   2  0.34% 97.18%    1203 305459 java.lang.Integer.hashCode
>   3  0.33% 97.51%    1168 304841 java.lang.Integer.hashCode
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/ 
> profile.out (Mapper)
> rank   self  accum   count trace method
>   1  5.59%  5.59%      32 300866  
> java.lang.ClassLoader.findBootstrapClass
>   2  4.20%  9.79%      24 300859 java.util.zip.ZipFile.read
>   3  3.67% 13.46%      21 301341  
> java.util.TimeZone.getSystemTimeZoneID
>   4  2.45% 15.91%      14 300119 java.util.zip.ZipFile.open
>   5  2.45% 18.36%      14 301365 java.io.UnixFileSystem.getLength
>   6  2.27% 20.63%      13 300857 java.lang.ClassLoader.defineClass1
>
>
> Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
> (Reducer)
> rank   self  accum   count trace method
>   1 93.77% 93.77%  236947 304890 java.lang.Object.<init>
>   2  1.46% 95.23%    3693 311379  
> sun.nio.ch.EPollArrayWrapper.epollWait
>
>
> I also took a heap dump when Mapper was running. 98% of the memory was
> used by the byte arrays allocated/referenced in
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>
> The document vectors for input set (of 100 docs) is available here.
> http://docs.google.com/Doc?id=dc5kkrf9_110fqtc63c3
>
> I create canopies with following command.
>
> $bin/hadoop jar ../mahout-examples-0.1.job
> org.apache.mahout.clustering.canopy.CanopyClusteringJob test100
> output/ org.apache.mahout.utils.EuclideanDistanceMeasure 80 55
>
> The t1, t2 values are the ones which were given for synthetic data
> example. Should the values of t1 and t2 affect the runtime
> dramatically?
>
> Thanks,
>
> --shashi

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message