From user-return-12032-apmail-hbase-user-archive=hbase.apache.org@hbase.apache.org Tue Aug 17 23:04:06 2010 Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 50602 invoked from network); 17 Aug 2010 23:04:06 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Aug 2010 23:04:06 -0000 Received: (qmail 73518 invoked by uid 500); 17 Aug 2010 23:04:05 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 73475 invoked by uid 500); 17 Aug 2010 23:04:05 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 73467 invoked by uid 99); 17 Aug 2010 23:04:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Aug 2010 23:04:05 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gstathis@gmail.com designates 74.125.82.169 as permitted sender) Received: from [74.125.82.169] (HELO mail-wy0-f169.google.com) (74.125.82.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Aug 2010 23:04:00 +0000 Received: by wyg36 with SMTP id 36so8775133wyg.14 for ; Tue, 17 Aug 2010 16:03:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:sender:received :in-reply-to:references:from:date:x-google-sender-auth:message-id :subject:to:content-type:content-transfer-encoding; bh=KZgx4mNZc7gZgqxky9me18hDt2lb2E1qCR2cdDEyDYg=; b=hEUkalytAJhlpggzQpvAZsMiGjhAAdY/6M8x7BEehKXsRiiehaiGDgLW4hYHPL1iQV o6sfCnqMcS38lXd/Gh9i3hjhgbzdfGOaYWrnd9wlXyOuiICCTsSicG5H4Kj5NVYLDEhp OpmZ0i39SytUqptGMVt0mQsBAEogwtgtsx9S0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type :content-transfer-encoding; b=f24RjsnIDi5jhIstpptLvSdJ3+OyYrQRfWdAJzbaNBVEYTg78/o7/djq9SZmVyq9qS g21xyukhuQwZQiVDIFkRPlN7JRrIUgPl6P+7BuR00wLYZTdD6MMRu8yJnq9LZH+/WZfc Dc3IqHhwmMrp6QdZM4L1fUTYpptwZwolkgc2U= Received: by 10.216.7.129 with SMTP id 1mr6272987wep.90.1282086219208; Tue, 17 Aug 2010 16:03:39 -0700 (PDT) MIME-Version: 1.0 Sender: gstathis@gmail.com Received: by 10.216.72.207 with HTTP; Tue, 17 Aug 2010 16:03:19 -0700 (PDT) In-Reply-To: References: From: "George P. Stathis" Date: Tue, 17 Aug 2010 19:03:19 -0400 X-Google-Sender-Auth: fLeqr_m7G4IbtueQYcVjAsZX2e0 Message-ID: Subject: Re: High OS Load Numbers when idle To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable vmstat 2 for 2 mins below. Looks like everything is in idle (github gist paste if it's easier to read: http://gist.github.com/532512): procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu--= -- r b swpd free buff cache si so bi bo in cs us sy id = wa 0 0 0 15097116 248428 1398444 0 0 0 50 5 24 0 0 100 0 0 0 0 15096948 248428 1398444 0 0 0 0 281 281 0 0 100 0 0 0 0 15096948 248428 1398444 0 0 0 0 279 260 0 0 100 0 0 0 0 15096948 248428 1398444 0 0 0 0 199 216 0 0 100 0 4 0 0 15096612 248428 1398448 0 0 0 0 528 467 0 0 100 0 0 0 0 15096612 248428 1398448 0 0 0 0 208 213 0 0 100 0 4 0 0 15096460 248428 1398448 0 0 0 0 251 261 0 0 100 0 0 0 0 15096460 248428 1398448 0 0 0 12 242 248 0 0 100 0 0 0 0 15096460 248428 1398448 0 0 0 34 228 230 0 0 100 0 0 0 0 15096476 248428 1398448 0 0 0 0 266 272 0 0 100 0 0 0 0 15096324 248428 1398448 0 0 0 10 179 206 0 0 100 0 1 0 0 15096340 248428 1398448 0 0 0 0 225 254 0 0 100 0 1 0 0 15096188 248428 1398448 0 0 0 0 263 245 0 0 100 0 0 0 0 15096188 248428 1398448 0 0 0 2 169 210 0 0 100 0 0 0 0 15096188 248428 1398448 0 0 0 0 201 238 0 0 100 0 0 0 0 15096036 248428 1398448 0 0 0 10 174 202 0 0 100 0 0 0 0 15096036 248428 1398448 0 0 0 6 207 222 0 0 100 0 0 0 0 15095884 248428 1398448 0 0 0 0 198 242 0 0 100 0 2 0 0 15095884 248428 1398448 0 0 0 0 177 215 0 0 100 0 0 0 0 15095884 248428 1398448 0 0 0 0 244 265 0 0 100 0 0 0 0 15095732 248428 1398448 0 0 0 4 197 222 0 0 100 0 6 0 0 15095732 248428 1398448 0 0 0 6 267 260 0 0 100 0 0 0 0 15095732 248428 1398448 0 0 0 0 240 239 0 0 100 0 0 0 0 15095580 248428 1398448 0 0 0 8 180 210 0 0 100 0 5 0 0 15095580 248428 1398448 0 0 0 0 193 224 0 0 100 0 1 0 0 15095580 248428 1398448 0 0 0 36 161 191 0 0 100 0 0 0 0 15095428 248428 1398448 0 0 0 0 176 216 0 0 100 0 4 0 0 15095428 248428 1398448 0 0 0 0 202 236 0 0 100 0 0 0 0 15095428 248428 1398448 0 0 0 6 191 220 0 0 100 0 1 0 0 15095428 248428 1398448 0 0 0 0 188 238 0 0 100 0 2 0 0 15095276 248428 1398448 0 0 0 0 174 206 0 0 100 0 1 0 0 15095276 248428 1398448 0 0 0 0 225 249 0 0 100 0 0 0 0 15095124 248428 1398448 0 0 0 0 222 263 0 0 100 0 1 0 0 15095124 248428 1398448 0 0 0 6 187 236 0 0 100 0 5 0 0 15094940 248428 1398448 0 0 0 0 453 434 0 0 100 0 4 0 0 15094788 248428 1398448 0 0 0 2 227 225 0 0 100 0 0 0 0 15094788 248428 1398448 0 0 0 0 213 236 0 0 100 0 4 0 0 15094788 248428 1398448 0 0 0 6 257 253 0 0 100 0 0 0 0 15094636 248428 1398448 0 0 0 0 215 230 0 0 100 0 0 0 0 15094652 248428 1398448 0 0 0 0 259 285 0 0 100 0 0 0 0 15094500 248428 1398448 0 0 0 14 194 219 0 0 100 0 0 0 0 15094516 248428 1398448 0 0 0 0 227 257 0 0 100 0 4 0 0 15094516 248428 1398448 0 0 0 36 266 263 0 0 100 0 1 0 0 15094516 248428 1398448 0 0 0 0 202 213 0 0 100 0 0 0 0 15094364 248428 1398448 0 0 0 0 204 240 0 0 100 0 0 0 0 15094212 248428 1398448 0 0 0 6 161 194 0 0 100 0 0 0 0 15094212 248428 1398448 0 0 0 0 191 215 0 0 100 0 1 0 0 15094212 248428 1398448 0 0 0 0 216 238 0 0 100 0 5 0 0 15094212 248428 1398448 0 0 0 6 169 202 0 0 100 0 0 0 0 15094060 248428 1398448 0 0 0 0 172 216 0 0 100 0 2 0 0 15094060 248428 1398448 0 0 0 6 201 196 0 0 100 0 1 0 0 15094060 248428 1398448 0 0 0 0 196 218 0 0 100 0 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu--= -- r b swpd free buff cache si so bi bo in cs us sy id = wa 0 0 0 15093908 248428 1398448 0 0 0 0 206 236 0 0 100 0 4 0 0 15093908 248428 1398448 0 0 0 0 197 219 0 0 100 0 0 0 0 15093908 248428 1398448 0 0 0 0 186 227 0 0 100 0 0 0 0 15093756 248428 1398448 0 0 0 0 168 182 0 0 100 0 0 0 0 15093756 248428 1398448 0 0 0 0 206 239 0 0 100 0 0 0 0 15093604 248428 1398448 0 0 0 6 281 248 0 0 100 0 0 0 0 15093452 248428 1398448 0 0 0 0 185 198 0 0 100 0 5 0 0 15093452 248428 1398448 0 0 0 0 265 253 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 36 194 211 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 0 228 242 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 0 290 262 0 0 100 0 0 0 0 15093300 248428 1398448 0 0 0 6 187 207 0 0 100 0 On Tue, Aug 17, 2010 at 6:54 PM, Ryan Rawson wrote: > what does vmstat say? =A0Run it like 'vmstat 2' for a minute or two and > paste the results. > > With no cpu being consumed by java, it seems like there must be > another hidden variable here. =A0Some zombied process perhaps. Or some > kind of super IO wait or something else. > > Since you are running on a hypervisor environment, i cant really say > what is happening to your instance, although one would think the LA > numbers would be unaffected by outside processes. > > On Tue, Aug 17, 2010 at 3:49 PM, George P. Stathis = wrote: >> Actually, there is nothing in %wa but a ton sitting in %id. This is >> from the Master: >> >> top - 18:30:24 up 5 days, 20:10, =A01 user, =A0load average: 2.55, 1.99,= 1.25 >> Tasks: =A089 total, =A0 1 running, =A088 sleeping, =A0 0 stopped, =A0 0 = zombie >> Cpu(s): =A00.0%us, =A00.0%sy, =A00.0%ni, 99.8%id, =A00.0%wa, =A00.0%hi, = =A00.0%si, =A00.2%st >> Mem: =A017920228k total, =A02795464k used, 15124764k free, =A0 248428k b= uffers >> Swap: =A0 =A0 =A0 =A00k total, =A0 =A0 =A0 =A00k used, =A0 =A0 =A0 =A00k= free, =A01398388k cached >> >> I have atop installed which is reporting the hadoop/hbase java daemons >> as the most active processes (barely taking any CPU time though and >> most of the time in sleep mode): >> >> ATOP - domU-12-31-39-18-1 2010/08/17 =A018:31:46 =A0 =A0 =A0 =A0 =A0 =A0= =A0 10 seconds elapsed >> PRC | sys =A0 0.01s | user =A0 0.00s | #proc =A0 =A0 89 | #zombie =A0 = =A00 | #exit =A0 =A0 =A00 | >> CPU | sys =A0 =A0 =A00% | user =A0 =A0 =A00% | irq =A0 =A0 =A0 0% | idle= =A0 =A0200% | wait =A0 =A0 =A00% | >> cpu | sys =A0 =A0 =A00% | user =A0 =A0 =A00% | irq =A0 =A0 =A0 0% | idle= =A0 =A0100% | cpu000 w =A00% | >> CPL | avg1 =A0 2.55 | avg5 =A0 =A02.12 | avg15 =A0 1.35 | csw =A0 =A0 23= 97 | intr =A0 =A02034 | >> MEM | tot =A0 17.1G | free =A0 14.4G | cache =A0 1.3G | buff =A0242.6M |= slab =A0193.1M | >> SWP | tot =A0 =A00.0M | free =A0 =A00.0M | =A0 =A0 =A0 =A0 =A0 =A0 =A0| = vmcom =A0 1.6G | vmlim =A0 8.5G | >> NET | transport =A0 | tcpi =A0 =A0 330 | tcpo =A0 =A0 169 | udpi =A0 =A0= 566 | udpo =A0 =A0 147 | >> NET | network =A0 =A0 | ipi =A0 =A0 =A0896 | ipo =A0 =A0 =A0316 | ipfrw = =A0 =A0 =A00 | deliv =A0 =A0896 | >> NET | eth0 =A0 ---- | pcki =A0 =A0 777 | pcko =A0 =A0 197 | si =A0248 Kb= ps | so =A0 70 Kbps | >> NET | lo =A0 =A0 ---- | pcki =A0 =A0 119 | pcko =A0 =A0 119 | si =A0 =A0= 9 Kbps | so =A0 =A09 Kbps | >> >> =A0PID =A0CPU COMMAND-LINE =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A01/1 >> 17613 =A0 0% atop >> 17150 =A0 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnO= utOfMemor >> 16527 =A0 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.su= n.managem >> 16839 =A0 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.su= n.managem >> 16735 =A0 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.su= n.managem >> 17083 =A0 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnO= utOfMemor >> >> Same with atop: >> >> =A0PID USER =A0 =A0 PRI =A0NI =A0VIRT =A0 RES =A0 SHR S CPU% MEM% =A0 TI= ME+ =A0Command >> 16527 ubuntu =A0 =A020 =A0 0 2352M =A0 98M 10336 S =A00.0 =A00.6 =A00:42= .05 >> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote >> -Dhadoop.log.dir=3D/var/log/h >> 16735 ubuntu =A0 =A020 =A0 0 2403M 81544 10236 S =A00.0 =A00.5 =A00:01.5= 6 >> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote >> -Dhadoop.log.dir=3D/var/log/h >> 17083 ubuntu =A0 =A020 =A0 0 4557M 45388 10912 S =A00.0 =A00.3 =A00:00.6= 5 >> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC >> -XX:+CMSIncrementalMode -server -XX:+Heap >> =A0 =A01 root =A0 =A0 =A020 =A0 0 23684 =A01880 =A01272 S =A00.0 =A00.0 = =A00:00.23 /sbin/init >> =A0587 root =A0 =A0 =A020 =A0 0 =A0247M =A04088 =A02432 S =A00.0 =A00.0 = -596523h-14:-8 >> /usr/sbin/console-kit-daemon --no-daemon >> =A03336 root =A0 =A0 =A020 =A0 0 49256 =A01092 =A0 540 S =A00.0 =A00.0 = =A00:00.36 /usr/sbin/sshd >> 16430 nobody =A0 =A020 =A0 0 34408 =A03704 =A01060 S =A00.0 =A00.0 =A00:= 00.01 gmond >> 17150 ubuntu =A0 =A020 =A0 0 2519M =A0112M 11312 S =A00.0 =A00.6 -596523= h-14:-8 >> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC >> -XX:+CMSIncrementalMode -server -XX >> >> >> So I'm a bit perplexed. Are there any hadoop / hbase specific tricks >> that can reveal what these processes are doing? >> >> -GS >> >> >> >> On Tue, Aug 17, 2010 at 6:14 PM, Jean-Daniel Cryans wrote: >>> >>> It's not normal, but then again I don't have access to your machines >>> so I can only speculate. >>> >>> Does "top" show you which process is in %wa? If so and it's a java >>> process, can you figure what's going on in there? >>> >>> J-D >>> >>> On Tue, Aug 17, 2010 at 11:03 AM, George Stathis w= rote: >>> > Hello, >>> > >>> > We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase >>> > 0.20.3. Our small setup as of right now consists of one master and fo= ur >>> > slaves with a replication factor of 2: >>> > >>> > Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 namenode= , 1 >>> > secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses its= ' own >>> > dedicated EMS drive) >>> > Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 data= node, 1 >>> > tasktracker, 1 regionserver >>> > >>> > We have also installed Ganglia to monitor the cluster stats as we are= about >>> > to run some performance tests but, right out of the box, we are notic= ing >>> > high system loads (especially on the master node) without any activit= y >>> > happening on the clister. Of course, the CPUs are not being utilized = at all, >>> > but Ganglia is reporting almost all nodes in the red as the 1, 5 an 1= 5 >>> > minute load times are all above 100% most of the time (i.e. there are= more >>> > than two processes at a time competing for the 2 CPUs time). >>> > >>> > Question1: is this normal? >>> > Question2: does it matter since each process barely uses any of the C= PU >>> > time? >>> > >>> > Thank you in advance and pardon the noob questions. >>> > >>> > -GS >>> > >> >