vmstat 2 for 2 mins below. Looks like everything is in idle (github gist paste if it's easier to read: http://gist.github.com/532512): procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 15097116 248428 1398444 0 0 0 50 5 24 0 0 100 0 0 0 0 15096948 248428 1398444 0 0 0 0 281 281 0 0 100 0 0 0 0 15096948 248428 1398444 0 0 0 0 279 260 0 0 100 0 0 0 0 15096948 248428 1398444 0 0 0 0 199 216 0 0 100 0 4 0 0 15096612 248428 1398448 0 0 0 0 528 467 0 0 100 0 0 0 0 15096612 248428 1398448 0 0 0 0 208 213 0 0 100 0 4 0 0 15096460 248428 1398448 0 0 0 0 251 261 0 0 100 0 0 0 0 15096460 248428 1398448 0 0 0 12 242 248 0 0 100 0 0 0 0 15096460 248428 1398448 0 0 0 34 228 230 0 0 100 0 0 0 0 15096476 248428 1398448 0 0 0 0 266 272 0 0 100 0 0 0 0 15096324 248428 1398448 0 0 0 10 179 206 0 0 100 0 1 0 0 15096340 248428 1398448 0 0 0 0 225 254 0 0 100 0 1 0 0 15096188 248428 1398448 0 0 0 0 263 245 0 0 100 0 0 0 0 15096188 248428 1398448 0 0 0 2 169 210 0 0 100 0 0 0 0 15096188 248428 1398448 0 0 0 0 201 238 0 0 100 0 0 0 0 15096036 248428 1398448 0 0 0 10 174 202 0 0 100 0 0 0 0 15096036 248428 1398448 0 0 0 6 207 222 0 0 100 0 0 0 0 15095884 248428 1398448 0 0 0 0 198 242 0 0 100 0 2 0 0 15095884 248428 1398448 0 0 0 0 177 215 0 0 100 0 0 0 0 15095884 248428 1398448 0 0 0 0 244 265 0 0 100 0 0 0 0 15095732 248428 1398448 0 0 0 4 197 222 0 0 100 0 6 0 0 15095732 248428 1398448 0 0 0 6 267 260 0 0 100 0 0 0 0 15095732 248428 1398448 0 0 0 0 240 239 0 0 100 0 0 0 0 15095580 248428 1398448 0 0 0 8 180 210 0 0 100 0 5 0 0 15095580 248428 1398448 0 0 0 0 193 224 0 0 100 0 1 0 0 15095580 248428 1398448 0 0 0 36 161 191 0 0 100 0 0 0 0 15095428 248428 1398448 0 0 0 0 176 216 0 0 100 0 4 0 0 15095428 248428 1398448 0 0 0 0 202 236 0 0 100 0 0 0 0 15095428 248428 1398448 0 0 0 6 191 220 0 0 100 0 1 0 0 15095428 248428 1398448 0 0 0 0 188 238 0 0 100 0 2 0 0 15095276 248428 1398448 0 0 0 0 174 206 0 0 100 0 1 0 0 15095276 248428 1398448 0 0 0 0 225 249 0 0 100 0 0 0 0 15095124 248428 1398448 0 0 0 0 222 263 0 0 100 0 1 0 0 15095124 248428 1398448 0 0 0 6 187 236 0 0 100 0 5 0 0 15094940 248428 1398448 0 0 0 0 453 434 0 0 100 0 4 0 0 15094788 248428 1398448 0 0 0 2 227 225 0 0 100 0 0 0 0 15094788 248428 1398448 0 0 0 0 213 236 0 0 100 0 4 0 0 15094788 248428 1398448 0 0 0 6 257 253 0 0 100 0 0 0 0 15094636 248428 1398448 0 0 0 0 215 230 0 0 100 0 0 0 0 15094652 248428 1398448 0 0 0 0 259 285 0 0 100 0 0 0 0 15094500 248428 1398448 0 0 0 14 194 219 0 0 100 0 0 0 0 15094516 248428 1398448 0 0 0 0 227 257 0 0 100 0 4 0 0 15094516 248428 1398448 0 0 0 36 266 263 0 0 100 0 1 0 0 15094516 248428 1398448 0 0 0 0 202 213 0 0 100 0 0 0 0 15094364 248428 1398448 0 0 0 0 204 240 0 0 100 0 0 0 0 15094212 248428 1398448 0 0 0 6 161 194 0 0 100 0 0 0 0 15094212 248428 1398448 0 0 0 0 191 215 0 0 100 0 1 0 0 15094212 248428 1398448 0 0 0 0 216 238 0 0 100 0 5 0 0 15094212 248428 1398448 0 0 0 6 169 202 0 0 100 0 0 0 0 15094060 248428 1398448 0 0 0 0 172 216 0 0 100 0 2 0 0 15094060 248428 1398448 0 0 0 6 201 196 0 0 100 0 1 0 0 15094060 248428 1398448 0 0 0 0 196 218 0 0 100 0 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 15093908 248428 1398448 0 0 0 0 206 236 0 0 100 0 4 0 0 15093908 248428 1398448 0 0 0 0 197 219 0 0 100 0 0 0 0 15093908 248428 1398448 0 0 0 0 186 227 0 0 100 0 0 0 0 15093756 248428 1398448 0 0 0 0 168 182 0 0 100 0 0 0 0 15093756 248428 1398448 0 0 0 0 206 239 0 0 100 0 0 0 0 15093604 248428 1398448 0 0 0 6 281 248 0 0 100 0 0 0 0 15093452 248428 1398448 0 0 0 0 185 198 0 0 100 0 5 0 0 15093452 248428 1398448 0 0 0 0 265 253 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 36 194 211 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 0 228 242 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 0 290 262 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 6 187 207 0 0 100 0 On Tue, Aug 17, 2010 at 6:54 PM, Ryan Rawson wrote: > what does vmstat say?  Run it like 'vmstat 2' for a minute or two and > paste the results. > > With no cpu being consumed by java, it seems like there must be > another hidden variable here.  Some zombied process perhaps. Or some > kind of super IO wait or something else. > > Since you are running on a hypervisor environment, i cant really say > what is happening to your instance, although one would think the LA > numbers would be unaffected by outside processes. > > On Tue, Aug 17, 2010 at 3:49 PM, George P. Stathis wrote: >> Actually, there is nothing in %wa but a ton sitting in %id. This is >> from the Master: >> >> top - 18:30:24 up 5 days, 20:10,  1 user,  load average: 2.55, 1.99, 1.25 >> Tasks:  89 total,   1 running,  88 sleeping,   0 stopped,   0 zombie >> Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.2%st >> Mem:  17920228k total,  2795464k used, 15124764k free,   248428k buffers >> Swap:        0k total,        0k used,        0k free,  1398388k cached >> >> I have atop installed which is reporting the hadoop/hbase java daemons >> as the most active processes (barely taking any CPU time though and >> most of the time in sleep mode): >> >> ATOP - domU-12-31-39-18-1 2010/08/17  18:31:46               10 seconds elapsed >> PRC | sys   0.01s | user   0.00s | #proc     89 | #zombie    0 | #exit      0 | >> CPU | sys      0% | user      0% | irq       0% | idle    200% | wait      0% | >> cpu | sys      0% | user      0% | irq       0% | idle    100% | cpu000 w  0% | >> CPL | avg1   2.55 | avg5    2.12 | avg15   1.35 | csw     2397 | intr    2034 | >> MEM | tot   17.1G | free   14.4G | cache   1.3G | buff  242.6M | slab  193.1M | >> SWP | tot    0.0M | free    0.0M |              | vmcom   1.6G | vmlim   8.5G | >> NET | transport   | tcpi     330 | tcpo     169 | udpi     566 | udpo     147 | >> NET | network     | ipi      896 | ipo      316 | ipfrw      0 | deliv    896 | >> NET | eth0   ---- | pcki     777 | pcko     197 | si  248 Kbps | so   70 Kbps | >> NET | lo     ---- | pcki     119 | pcko     119 | si    9 Kbps | so    9 Kbps | >> >>  PID  CPU COMMAND-LINE                                                  1/1 >> 17613   0% atop >> 17150   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemor >> 16527   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.managem >> 16839   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.managem >> 16735   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.managem >> 17083   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemor >> >> Same with atop: >> >>  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command >> 16527 ubuntu    20   0 2352M   98M 10336 S  0.0  0.6  0:42.05 >> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote >> -Dhadoop.log.dir=/var/log/h >> 16735 ubuntu    20   0 2403M 81544 10236 S  0.0  0.5  0:01.56 >> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote >> -Dhadoop.log.dir=/var/log/h >> 17083 ubuntu    20   0 4557M 45388 10912 S  0.0  0.3  0:00.65 >> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC >> -XX:+CMSIncrementalMode -server -XX:+Heap >>    1 root      20   0 23684  1880  1272 S  0.0  0.0  0:00.23 /sbin/init >>  587 root      20   0  247M  4088  2432 S  0.0  0.0 -596523h-14:-8 >> /usr/sbin/console-kit-daemon --no-daemon >>  3336 root      20   0 49256  1092   540 S  0.0  0.0  0:00.36 /usr/sbin/sshd >> 16430 nobody    20   0 34408  3704  1060 S  0.0  0.0  0:00.01 gmond >> 17150 ubuntu    20   0 2519M  112M 11312 S  0.0  0.6 -596523h-14:-8 >> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC >> -XX:+CMSIncrementalMode -server -XX >> >> >> So I'm a bit perplexed. Are there any hadoop / hbase specific tricks >> that can reveal what these processes are doing? >> >> -GS >> >> >> >> On Tue, Aug 17, 2010 at 6:14 PM, Jean-Daniel Cryans wrote: >>> >>> It's not normal, but then again I don't have access to your machines >>> so I can only speculate. >>> >>> Does "top" show you which process is in %wa? If so and it's a java >>> process, can you figure what's going on in there? >>> >>> J-D >>> >>> On Tue, Aug 17, 2010 at 11:03 AM, George Stathis wrote: >>> > Hello, >>> > >>> > We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase >>> > 0.20.3. Our small setup as of right now consists of one master and four >>> > slaves with a replication factor of 2: >>> > >>> > Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 namenode, 1 >>> > secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses its' own >>> > dedicated EMS drive) >>> > Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 datanode, 1 >>> > tasktracker, 1 regionserver >>> > >>> > We have also installed Ganglia to monitor the cluster stats as we are about >>> > to run some performance tests but, right out of the box, we are noticing >>> > high system loads (especially on the master node) without any activity >>> > happening on the clister. Of course, the CPUs are not being utilized at all, >>> > but Ganglia is reporting almost all nodes in the red as the 1, 5 an 15 >>> > minute load times are all above 100% most of the time (i.e. there are more >>> > than two processes at a time competing for the 2 CPUs time). >>> > >>> > Question1: is this normal? >>> > Question2: does it matter since each process barely uses any of the CPU >>> > time? >>> > >>> > Thank you in advance and pardon the noob questions. >>> > >>> > -GS >>> > >> >