hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Re: How to know the root reason to cause RegionServer OOM?
Date Mon, 18 May 2015 16:47:36 GMT
You need to not overcommit memory on servers running JVMs for HDFS and
HBase (and YARN, and containers, if colocating Hadoop MR). Sum the -Xmx
parameter, the maximum heap size, for all JVMs that will be concurrently
executing on the server. The total should be less than the total amount of
RAM available on the server. Additionally you will want to reserve ~1GB for
the OS. Finally, set vm.swappiness=0 in /etc/sysctl.conf to prevent
unnecessary paging.


On Sun, May 17, 2015 at 8:08 PM, David chen <c77_cn@163.com> wrote:

> The snippet in /var/log/messages is as follows, i am sure that process
> killed(22827) is RegsionServer.
> ......
> May 14 12:00:38 localhost kernel: Mem-Info:
> May 14 12:00:38 localhost kernel: Node 0 DMA per-cpu:
> May 14 12:00:38 localhost kernel: CPU    0: hi:    0, btch:   1 usd:   0
> ......
> May 14 12:00:38 localhost kernel: CPU   39: hi:    0, btch:   1 usd:   0
> May 14 12:00:38 localhost kernel: Node 0 DMA32 per-cpu:
> May 14 12:00:38 localhost kernel: CPU    0: hi:  186, btch:  31 usd:  30
> ......
> May 14 12:00:38 localhost kernel: CPU   39: hi:  186, btch:  31 usd:   8
> May 14 12:00:38 localhost kernel: Node 0 Normal per-cpu:
> May 14 12:00:38 localhost kernel: CPU    0: hi:  186, btch:  31 usd:   5
> ......
> May 14 12:00:38 localhost kernel: CPU   39: hi:  186, btch:  31 usd:  20
> May 14 12:00:38 localhost kernel: Node 1 Normal per-cpu:
> May 14 12:00:38 localhost kernel: CPU    0: hi:  186, btch:  31 usd:   7
> ......
> May 14 12:00:38 localhost kernel: CPU   39: hi:  186, btch:  31 usd:  10
> May 14 12:00:38 localhost kernel: active_anon:7993118 inactive_anon:48001
> isolated_anon:0
> May 14 12:00:38 localhost kernel: active_file:855 inactive_file:960
> isolated_file:0
> May 14 12:00:38 localhost kernel: unevictable:0 dirty:0 writeback:0
> unstable:0
> May 14 12:00:38 localhost kernel: free:39239 slab_reclaimable:14043
> slab_unreclaimable:27993
> May 14 12:00:38 localhost kernel: mapped:48750 shmem:75053
> pagetables:20540 bounce:0
> May 14 12:00:38 localhost kernel: Node 0 DMA free:15732kB min:40kB
> low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:15336kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
> slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 3211 16088 16088
> May 14 12:00:38 localhost kernel: Node 0 DMA32 free:60388kB min:8968kB
> low:11208kB high:13452kB active_anon:2811676kB inactive_anon:72kB
> active_file:0kB inactive_file:788kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:3288224kB mlocked:0kB dirty:0kB writeback:44kB
> mapped:156kB shmem:8232kB slab_reclaimable:10652kB
> slab_unreclaimable:5144kB kernel_stack:56kB pagetables:4252kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:1312 all_unreclaimable? yes
> May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 0 12877 12877
> May 14 12:00:38 localhost kernel: Node 0 Normal free:35772kB min:35964kB
> low:44952kB high:53944kB active_anon:13062472kB inactive_anon:4864kB
> active_file:1268kB inactive_file:1504kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:13186560kB mlocked:0kB dirty:0kB writeback:92kB
> mapped:6172kB shmem:51928kB slab_reclaimable:22732kB
> slab_unreclaimable:73204kB kernel_stack:16240kB pagetables:38040kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:10268
> all_unreclaimable? yes
> May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 0 0 0
> May 14 12:00:38 localhost kernel: Node 1 Normal free:45064kB min:45132kB
> low:56412kB high:67696kB active_anon:16098324kB inactive_anon:187068kB
> active_file:2192kB inactive_file:1548kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:16547840kB mlocked:0kB dirty:116kB writeback:0kB
> mapped:188672kB shmem:240052kB slab_reclaimable:22788kB
> slab_unreclaimable:33624kB kernel_stack:7352kB pagetables:39868kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:12064
> all_unreclaimable? yes
> May 14 12:00:38 localhost kernel: lowmem_reserve[]: 0 0 0 0
> May 14 12:00:38 localhost kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 1*32kB
> 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15732kB
> May 14 12:00:38 localhost kernel: Node 0 DMA32: 659*4kB 576*8kB 485*16kB
> 338*32kB 208*64kB 106*128kB 27*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB =
> 60636kB
> May 14 12:00:38 localhost kernel: Node 0 Normal: 1166*4kB 579*8kB 337*16kB
> 203*32kB 106*64kB 61*128kB 3*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB =
> 37568kB
> May 14 12:00:38 localhost kernel: Node 1 Normal: 668*4kB 405*8kB 422*16kB
> 259*32kB 176*64kB 67*128kB 7*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB =
> 43608kB
> May 14 12:00:38 localhost kernel: 78257 total pagecache pages
> May 14 12:00:38 localhost kernel: 0 pages in swap cache
> May 14 12:00:38 localhost kernel: Swap cache stats: add 0, delete 0, find
> 0/0
> May 14 12:00:38 localhost kernel: Free swap  = 0kB
> May 14 12:00:38 localhost kernel: Total swap = 0kB
> May 14 12:00:38 localhost kernel: 8388607 pages RAM
> May 14 12:00:38 localhost kernel: 181753 pages reserved
> May 14 12:00:38 localhost kernel: 77957 pages shared
> May 14 12:00:38 localhost kernel: 8104642 pages non-shared
> May 14 12:00:38 localhost kernel: [ pid ]   uid  tgid total_vm      rss
> cpu oom_adj oom_score_adj name
> ......
> May 14 12:00:38 localhost kernel: [22827]   483 22827  4392305  4074129
> 23       0             0 java
> May 14 12:00:38 localhost kernel: [38727]   483 38727   428355    74385
> 22       0             0 java
> ......
> May 14 12:00:38 localhost kernel: Out of memory: Kill process 22827 (java)
> score 497 or sacrifice child
> May 14 12:00:38 localhost kernel: Killed process 22827, UID 483, (java)
> total-vm:17569220kB, anon-rss:16296276kB, file-rss:240kB
> May 14 12:00:38 localhost kernel: sleep invoked oom-killer:
> gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
> May 14 12:00:38 localhost kernel: sleep cpuset=/ mems_allowed=0-1
> May 14 12:00:38 localhost kernel: Pid: 31136, comm: sleep Not tainted
> 2.6.32-358.el6.x86_64 #1
> May 14 12:00:38 localhost kernel: Call Trace:
> May 14 12:00:38 localhost kernel: [<ffffffff810cb5d1>] ?
> cpuset_print_task_mems_allowed+0x91/0xb0
> May 14 12:00:38 localhost kernel: [<ffffffff8111cd10>] ?
> dump_header+0x90/0x1b0
> May 14 12:00:38 localhost kernel: [<ffffffff810e91ee>] ?
> __delayacct_freepages_end+0x2e/0x30
> May 14 12:00:38 localhost kernel: [<ffffffff8121d0bc>] ?
> security_real_capable_noaudit+0x3c/0x70
> May 14 12:00:38 localhost kernel: [<ffffffff8111d192>] ?
> oom_kill_process+0x82/0x2a0
> May 14 12:00:38 localhost kernel: [<ffffffff8111d0d1>] ?
> select_bad_process+0xe1/0x120
> May 14 12:00:38 localhost kernel: [<ffffffff8111d5d0>] ?
> out_of_memory+0x220/0x3c0
> May 14 12:00:38 localhost kernel: [<ffffffff8112c27c>] ?
> __alloc_pages_nodemask+0x8ac/0x8d0
> May 14 12:00:38 localhost kernel: [<ffffffff8116087a>] ?
> alloc_pages_current+0xaa/0x110
> May 14 12:00:38 localhost kernel: [<ffffffff8111a0f7>] ?
> __page_cache_alloc+0x87/0x90
> May 14 12:00:38 localhost kernel: [<ffffffff81119ade>] ?
> find_get_page+0x1e/0xa0
> May 14 12:00:38 localhost kernel: [<ffffffff8111b0b7>] ?
> filemap_fault+0x1a7/0x500
> May 14 12:00:38 localhost kernel: [<ffffffff811430b4>] ?
> __do_fault+0x54/0x530
> May 14 12:00:38 localhost kernel: [<ffffffff81059784>] ?
> find_busiest_group+0x244/0x9f0
> May 14 12:00:38 localhost kernel: [<ffffffff81143687>] ?
> handle_pte_fault+0xf7/0xb50
> May 14 12:00:38 localhost kernel: [<ffffffff8105e203>] ?
> perf_event_task_sched_out+0x33/0x80
> May 14 12:00:38 localhost kernel: [<ffffffff8114431a>] ?
> handle_mm_fault+0x23a/0x310
> May 14 12:00:38 localhost kernel: [<ffffffff810474c9>] ?
> __do_page_fault+0x139/0x480
> May 14 12:00:38 localhost kernel: [<ffffffff8109be2f>] ?
> hrtimer_try_to_cancel+0x3f/0xd0
> May 14 12:00:38 localhost kernel: [<ffffffff8109bee2>] ?
> hrtimer_cancel+0x22/0x30
> May 14 12:00:38 localhost kernel: [<ffffffff8150f1b3>] ?
> do_nanosleep+0x93/0xc0
> May 14 12:00:38 localhost kernel: [<ffffffff8109bfb4>] ?
> hrtimer_nanosleep+0xc4/0x180
> May 14 12:00:38 localhost kernel: [<ffffffff8109ae00>] ?
> hrtimer_wakeup+0x0/0x30
> May 14 12:00:38 localhost kernel: [<ffffffff8151311e>] ?
> do_page_fault+0x3e/0xa0
> May 14 12:00:38 localhost kernel: [<ffffffff815104d5>] ?
> page_fault+0x25/0x30
> ......
>
>
>
>
>
>
>
>
>
>
> At 2015-05-16 02:39:02, "iain wright" <iainwrig@gmail.com> wrote:
> >What log is this seen in? Can you paste the log line? Do you mean
> >/var/log/messages?
> >On May 12, 2015 7:44 PM, "David chen" <c77_cn@163.com> wrote:
> >
> >> A RegionServer was killed because OutOfMemory(OOM), although  the
> process
> >> killed can be seen in the Linux message log, but i still have two
> following
> >> problems:
> >> 1. How to inspect the root reason to cause OOM?
> >> 2  When RegionServer encounters OOM, why can't it free some memories
> >> occupied? if so, whether or not killer will not need.
> >> Any ideas can be appreciated!
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message