Thank you to JD as well for first response!
On Tue, Aug 17, 2010 at 8:59 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
> You're welcome. Glad you figured it out! Very strange, and it
> totally makes sense this is a kernel/os bug.
>
> -ryan
>
> On Tue, Aug 17, 2010 at 5:28 PM, George P. Stathis <gstathis@traackr.com> wrote:
>> OK, high load averages while idle is a confirmed Lucid issue:
>> https://bugs.launchpad.net/ubuntu-on-ec2/+bug/574910
>>
>> Will downgrade back to 9.10 and try again. Thanks for helping out
>> Ryan. If you hadn't confirmed that this is not normal, I would have
>> wasted a lot more time looking in the wrong places. Now I just have to
>> rebuild us a new cluster...
>>
>> -GS
>>
>> On Tue, Aug 17, 2010 at 7:39 PM, George P. Stathis <gstathis@traackr.com> wrote:
>>> Yeap. Load average definitely goes back down to 0.10 or below once
>>> hadoop/hbase is shut off. Same issue is happening on the slaves as
>>> well BTW. We are running on Ubuntu 10.04 which is the only recent
>>> change other than moving to a cluster setup. We were running a
>>> pseudo-distributed setup before on one machine running Ubuntu 9.10.
>>> Top shows just as much activity under %id but load average was close
>>> to zero when idle! Environment is hypervisor'ed as well, so I suspect
>>> that's not really the issue...:-) Know anyone running on an Ubuntu
>>> 10.04 setup?
>>>
>>> -GS
>>>
>>> On Tue, Aug 17, 2010 at 7:08 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>>>> This is odd, the run queue has quite a bit of processes in it, but the
>>>> actual cpu data says 'nothings happening'.
>>>>
>>>> Can't tell which processes are in the run queue here, does the load
>>>> average go down when you take hadoop/hbase down?
>>>>
>>>> If so, that seems to identify the locus of the problem, but doesn't
>>>> really explain WHY this is so. Consider I've never really seen this
>>>> before I'd just write it off to using a hypervisor'ed environment and
>>>> blame other people stealing all your cpus :-)
>>>>
>>>> On Tue, Aug 17, 2010 at 4:03 PM, George P. Stathis <gstathis@traackr.com>
wrote:
>>>>> vmstat 2 for 2 mins below. Looks like everything is in idle (github
>>>>> gist paste if it's easier to read: http://gist.github.com/532512):
>>>>>
>>>>>
>>>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>>>> r b swpd free buff cache si so bi bo
in cs us sy id wa
>>>>> 0 0 0 15097116 248428 1398444 0 0 0 50
5 24
>>>>> 0 0 100 0
>>>>> 0 0 0 15096948 248428 1398444 0 0 0
0 281 281
>>>>> 0 0 100 0
>>>>> 0 0 0 15096948 248428 1398444 0 0 0
0 279 260
>>>>> 0 0 100 0
>>>>> 0 0 0 15096948 248428 1398444 0 0 0
0 199 216
>>>>> 0 0 100 0
>>>>> 4 0 0 15096612 248428 1398448 0 0 0
0 528 467
>>>>> 0 0 100 0
>>>>> 0 0 0 15096612 248428 1398448 0 0 0
0 208 213
>>>>> 0 0 100 0
>>>>> 4 0 0 15096460 248428 1398448 0 0 0
0 251 261
>>>>> 0 0 100 0
>>>>> 0 0 0 15096460 248428 1398448 0 0 0 12
242 248
>>>>> 0 0 100 0
>>>>> 0 0 0 15096460 248428 1398448 0 0 0 34
228 230
>>>>> 0 0 100 0
>>>>> 0 0 0 15096476 248428 1398448 0 0 0
0 266 272
>>>>> 0 0 100 0
>>>>> 0 0 0 15096324 248428 1398448 0 0 0 10
179 206
>>>>> 0 0 100 0
>>>>> 1 0 0 15096340 248428 1398448 0 0 0
0 225 254
>>>>> 0 0 100 0
>>>>> 1 0 0 15096188 248428 1398448 0 0 0
0 263 245
>>>>> 0 0 100 0
>>>>> 0 0 0 15096188 248428 1398448 0 0 0
2 169 210
>>>>> 0 0 100 0
>>>>> 0 0 0 15096188 248428 1398448 0 0 0
0 201 238
>>>>> 0 0 100 0
>>>>> 0 0 0 15096036 248428 1398448 0 0 0 10
174 202
>>>>> 0 0 100 0
>>>>> 0 0 0 15096036 248428 1398448 0 0 0
6 207 222
>>>>> 0 0 100 0
>>>>> 0 0 0 15095884 248428 1398448 0 0 0
0 198 242
>>>>> 0 0 100 0
>>>>> 2 0 0 15095884 248428 1398448 0 0 0
0 177 215
>>>>> 0 0 100 0
>>>>> 0 0 0 15095884 248428 1398448 0 0 0
0 244 265
>>>>> 0 0 100 0
>>>>> 0 0 0 15095732 248428 1398448 0 0 0
4 197 222
>>>>> 0 0 100 0
>>>>> 6 0 0 15095732 248428 1398448 0 0 0
6 267 260
>>>>> 0 0 100 0
>>>>> 0 0 0 15095732 248428 1398448 0 0 0
0 240 239
>>>>> 0 0 100 0
>>>>> 0 0 0 15095580 248428 1398448 0 0 0
8 180 210
>>>>> 0 0 100 0
>>>>> 5 0 0 15095580 248428 1398448 0 0 0
0 193 224
>>>>> 0 0 100 0
>>>>> 1 0 0 15095580 248428 1398448 0 0 0 36
161 191
>>>>> 0 0 100 0
>>>>> 0 0 0 15095428 248428 1398448 0 0 0
0 176 216
>>>>> 0 0 100 0
>>>>> 4 0 0 15095428 248428 1398448 0 0 0
0 202 236
>>>>> 0 0 100 0
>>>>> 0 0 0 15095428 248428 1398448 0 0 0
6 191 220
>>>>> 0 0 100 0
>>>>> 1 0 0 15095428 248428 1398448 0 0 0
0 188 238
>>>>> 0 0 100 0
>>>>> 2 0 0 15095276 248428 1398448 0 0 0
0 174 206
>>>>> 0 0 100 0
>>>>> 1 0 0 15095276 248428 1398448 0 0 0
0 225 249
>>>>> 0 0 100 0
>>>>> 0 0 0 15095124 248428 1398448 0 0 0
0 222 263
>>>>> 0 0 100 0
>>>>> 1 0 0 15095124 248428 1398448 0 0 0
6 187 236
>>>>> 0 0 100 0
>>>>> 5 0 0 15094940 248428 1398448 0 0 0
0 453 434
>>>>> 0 0 100 0
>>>>> 4 0 0 15094788 248428 1398448 0 0 0
2 227 225
>>>>> 0 0 100 0
>>>>> 0 0 0 15094788 248428 1398448 0 0 0
0 213 236
>>>>> 0 0 100 0
>>>>> 4 0 0 15094788 248428 1398448 0 0 0
6 257 253
>>>>> 0 0 100 0
>>>>> 0 0 0 15094636 248428 1398448 0 0 0
0 215 230
>>>>> 0 0 100 0
>>>>> 0 0 0 15094652 248428 1398448 0 0 0
0 259 285
>>>>> 0 0 100 0
>>>>> 0 0 0 15094500 248428 1398448 0 0 0 14
194 219
>>>>> 0 0 100 0
>>>>> 0 0 0 15094516 248428 1398448 0 0 0
0 227 257
>>>>> 0 0 100 0
>>>>> 4 0 0 15094516 248428 1398448 0 0 0 36
266 263
>>>>> 0 0 100 0
>>>>> 1 0 0 15094516 248428 1398448 0 0 0
0 202 213
>>>>> 0 0 100 0
>>>>> 0 0 0 15094364 248428 1398448 0 0 0
0 204 240
>>>>> 0 0 100 0
>>>>> 0 0 0 15094212 248428 1398448 0 0 0
6 161 194
>>>>> 0 0 100 0
>>>>> 0 0 0 15094212 248428 1398448 0 0 0
0 191 215
>>>>> 0 0 100 0
>>>>> 1 0 0 15094212 248428 1398448 0 0 0
0 216 238
>>>>> 0 0 100 0
>>>>> 5 0 0 15094212 248428 1398448 0 0 0
6 169 202
>>>>> 0 0 100 0
>>>>> 0 0 0 15094060 248428 1398448 0 0 0
0 172 216
>>>>> 0 0 100 0
>>>>> 2 0 0 15094060 248428 1398448 0 0 0
6 201 196
>>>>> 0 0 100 0
>>>>> 1 0 0 15094060 248428 1398448 0 0 0
0 196 218
>>>>> 0 0 100 0
>>>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>>>> r b swpd free buff cache si so bi bo
in cs us sy id wa
>>>>> 0 0 0 15093908 248428 1398448 0 0 0
0 206 236
>>>>> 0 0 100 0
>>>>> 4 0 0 15093908 248428 1398448 0 0 0
0 197 219
>>>>> 0 0 100 0
>>>>> 0 0 0 15093908 248428 1398448 0 0 0
0 186 227
>>>>> 0 0 100 0
>>>>> 0 0 0 15093756 248428 1398448 0 0 0
0 168 182
>>>>> 0 0 100 0
>>>>> 0 0 0 15093756 248428 1398448 0 0 0
0 206 239
>>>>> 0 0 100 0
>>>>> 0 0 0 15093604 248428 1398448 0 0 0
6 281 248
>>>>> 0 0 100 0
>>>>> 0 0 0 15093452 248428 1398448 0 0 0
0 185 198
>>>>> 0 0 100 0
>>>>> 5 0 0 15093452 248428 1398448 0 0 0
0 265 253
>>>>> 0 0 100 0
>>>>> 0 0 0 15093300 248428 1398448 0 0 0 36
194 211
>>>>> 0 0 100 0
>>>>> 0 0 0 15093300 248428 1398448 0 0 0
0 228 242
>>>>> 0 0 100 0
>>>>> 0 0 0 15093300 248428 1398448 0 0 0
0 290 262
>>>>> 0 0 100 0
>>>>> 0 0 0 15093300 248428 1398448 0 0 0
6 187 207
>>>>> 0 0 100 0
>>>>>
>>>>>
>>>>> On Tue, Aug 17, 2010 at 6:54 PM, Ryan Rawson <ryanobjc@gmail.com>
wrote:
>>>>>> what does vmstat say? Run it like 'vmstat 2' for a minute or two
and
>>>>>> paste the results.
>>>>>>
>>>>>> With no cpu being consumed by java, it seems like there must be
>>>>>> another hidden variable here. Some zombied process perhaps. Or
some
>>>>>> kind of super IO wait or something else.
>>>>>>
>>>>>> Since you are running on a hypervisor environment, i cant really
say
>>>>>> what is happening to your instance, although one would think the
LA
>>>>>> numbers would be unaffected by outside processes.
>>>>>>
>>>>>> On Tue, Aug 17, 2010 at 3:49 PM, George P. Stathis <gstathis@traackr.com>
wrote:
>>>>>>> Actually, there is nothing in %wa but a ton sitting in %id. This
is
>>>>>>> from the Master:
>>>>>>>
>>>>>>> top - 18:30:24 up 5 days, 20:10, 1 user, load average: 2.55,
1.99, 1.25
>>>>>>> Tasks: 89 total, 1 running, 88 sleeping, 0 stopped,
0 zombie
>>>>>>> Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi,
0.0%si, 0.2%st
>>>>>>> Mem: 17920228k total, 2795464k used, 15124764k free, 248428k
buffers
>>>>>>> Swap: 0k total, 0k used, 0k
free, 1398388k cached
>>>>>>>
>>>>>>> I have atop installed which is reporting the hadoop/hbase java
daemons
>>>>>>> as the most active processes (barely taking any CPU time though
and
>>>>>>> most of the time in sleep mode):
>>>>>>>
>>>>>>> ATOP - domU-12-31-39-18-1 2010/08/17 18:31:46
10 seconds elapsed
>>>>>>> PRC | sys 0.01s | user 0.00s | #proc 89 | #zombie
0 | #exit 0 |
>>>>>>> CPU | sys 0% | user 0% | irq 0% | idle
200% | wait 0% |
>>>>>>> cpu | sys 0% | user 0% | irq 0% | idle
100% | cpu000 w 0% |
>>>>>>> CPL | avg1 2.55 | avg5 2.12 | avg15 1.35 | csw
2397 | intr 2034 |
>>>>>>> MEM | tot 17.1G | free 14.4G | cache 1.3G | buff 242.6M
| slab 193.1M |
>>>>>>> SWP | tot 0.0M | free 0.0M | |
vmcom 1.6G | vmlim 8.5G |
>>>>>>> NET | transport | tcpi 330 | tcpo 169 | udpi
566 | udpo 147 |
>>>>>>> NET | network | ipi 896 | ipo 316 | ipfrw
0 | deliv 896 |
>>>>>>> NET | eth0 ---- | pcki 777 | pcko 197 | si 248
Kbps | so 70 Kbps |
>>>>>>> NET | lo ---- | pcki 119 | pcko 119 | si
9 Kbps | so 9 Kbps |
>>>>>>>
>>>>>>> PID CPU COMMAND-LINE
1/1
>>>>>>> 17613 0% atop
>>>>>>> 17150 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemor
>>>>>>> 16527 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
-Dcom.sun.managem
>>>>>>> 16839 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
-Dcom.sun.managem
>>>>>>> 16735 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
-Dcom.sun.managem
>>>>>>> 17083 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemor
>>>>>>>
>>>>>>> Same with atop:
>>>>>>>
>>>>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM%
TIME+ Command
>>>>>>> 16527 ubuntu 20 0 2352M 98M 10336 S 0.0 0.6 0:42.05
>>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
>>>>>>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote
>>>>>>> -Dhadoop.log.dir=/var/log/h
>>>>>>> 16735 ubuntu 20 0 2403M 81544 10236 S 0.0 0.5 0:01.56
>>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
>>>>>>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote
>>>>>>> -Dhadoop.log.dir=/var/log/h
>>>>>>> 17083 ubuntu 20 0 4557M 45388 10912 S 0.0 0.3 0:00.65
>>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m
>>>>>>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
>>>>>>> -XX:+CMSIncrementalMode -server -XX:+Heap
>>>>>>> 1 root 20 0 23684 1880 1272 S 0.0 0.0
0:00.23 /sbin/init
>>>>>>> 587 root 20 0 247M 4088 2432 S 0.0 0.0
-596523h-14:-8
>>>>>>> /usr/sbin/console-kit-daemon --no-daemon
>>>>>>> 3336 root 20 0 49256 1092 540 S 0.0 0.0
0:00.36 /usr/sbin/sshd
>>>>>>> 16430 nobody 20 0 34408 3704 1060 S 0.0 0.0 0:00.01
gmond
>>>>>>> 17150 ubuntu 20 0 2519M 112M 11312 S 0.0 0.6 -596523h-14:-8
>>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m
>>>>>>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
>>>>>>> -XX:+CMSIncrementalMode -server -XX
>>>>>>>
>>>>>>>
>>>>>>> So I'm a bit perplexed. Are there any hadoop / hbase specific
tricks
>>>>>>> that can reveal what these processes are doing?
>>>>>>>
>>>>>>> -GS
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 17, 2010 at 6:14 PM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>>>>
>>>>>>>> It's not normal, but then again I don't have access to your
machines
>>>>>>>> so I can only speculate.
>>>>>>>>
>>>>>>>> Does "top" show you which process is in %wa? If so and it's
a java
>>>>>>>> process, can you figure what's going on in there?
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Tue, Aug 17, 2010 at 11:03 AM, George Stathis <gstathis@gmail.com>
wrote:
>>>>>>>> > Hello,
>>>>>>>> >
>>>>>>>> > We have just setup a new cluster on EC2 using Hadoop
0.20.2 and HBase
>>>>>>>> > 0.20.3. Our small setup as of right now consists of
one master and four
>>>>>>>> > slaves with a replication factor of 2:
>>>>>>>> >
>>>>>>>> > Master: xLarge instance with 2 CPUs and 17.5 GB RAM
- runs 1 namenode, 1
>>>>>>>> > secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper
(uses its' own
>>>>>>>> > dedicated EMS drive)
>>>>>>>> > Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM
each - run 1 datanode, 1
>>>>>>>> > tasktracker, 1 regionserver
>>>>>>>> >
>>>>>>>> > We have also installed Ganglia to monitor the cluster
stats as we are about
>>>>>>>> > to run some performance tests but, right out of the
box, we are noticing
>>>>>>>> > high system loads (especially on the master node) without
any activity
>>>>>>>> > happening on the clister. Of course, the CPUs are not
being utilized at all,
>>>>>>>> > but Ganglia is reporting almost all nodes in the red
as the 1, 5 an 15
>>>>>>>> > minute load times are all above 100% most of the time
(i.e. there are more
>>>>>>>> > than two processes at a time competing for the 2 CPUs
time).
>>>>>>>> >
>>>>>>>> > Question1: is this normal?
>>>>>>>> > Question2: does it matter since each process barely
uses any of the CPU
>>>>>>>> > time?
>>>>>>>> >
>>>>>>>> > Thank you in advance and pardon the noob questions.
>>>>>>>> >
>>>>>>>> > -GS
>>>>>>>> >
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
|