trafficserver-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Call <bc...@apache.org>
Subject Re: Understanding ATS memory usage
Date Sat, 27 Jan 2018 00:20:33 GMT
I came across this command and it has helped track down some latency issues caused by other
processes (ss -s).  Can you run it during the time you are seeing latency issues and post
the results here?

dstat -c --top-cpu -d --top-bio --top-latency -n

-Bryan

> On Jan 26, 2018, at 5:48 AM, Veiko Kukk <veiko.kukk@gmail.com> wrote:
> 
> Hi again,
> 
> I'd really appreciate if somebody could point me in the right direction how to solve
this.
> During whatever ATS does after each ~ 50 minutes, it has strong effect on response times.
> ATS is used in reverse proxy mode and we run regular tests against test URL on proxied
server(s) (excluded from caching in ATS config).
> This test GET is ran as HAproxy health check after ~15 seconds for two local HAproxy
backends which both pass requests to single local ATS.
> 
> Quite complex setup, but point here being that tests run frequently and give information
about ATS response times over long period.
> 
> Total test runs today: 6364
> Tests that took over 7s today: 50
> 
> Distribution of requests, first column is response time, second amount of requests under
that value.
> 100 1292
> 300 4351
> 500 5194
> 700 5578
> 900 5794
> 1200 5985
> 1400 6058
> 1800 6143
> 
> Here is the output of tests log that contains all the extremely slow responses. Test
response size is only 609 bytes. Usually response time fluctuates around 
> 
> 2018-01-26T01:13:32.150186+00:00 12412
> 2018-01-26T01:13:32.150188+00:00 20803
> 2018-01-26T02:05:04.536931+00:00 29764
> 2018-01-26T02:05:04.536936+00:00 27271
> 2018-01-26T02:05:04.536941+00:00 10233
> 2018-01-26T02:56:26.968987+00:00 9511
> 2018-01-26T02:56:26.968989+00:00 30084
> 2018-01-26T02:56:26.968991+00:00 27337
> 2018-01-26T04:39:21.947460+00:00 24171
> 2018-01-26T04:39:21.947462+00:00 12042
> 2018-01-26T04:39:21.947464+00:00 36979
> 2018-01-26T04:39:31.954116+00:00 7369
> 2018-01-26T04:39:31.954118+00:00 32305
> 2018-01-26T04:39:31.954120+00:00 19779
> 2018-01-26T04:47:42.349748+00:00 29177
> 2018-01-26T04:47:42.349754+00:00 26212
> 2018-01-26T04:47:42.349757+00:00 21645
> 2018-01-26T04:47:42.349759+00:00 24932
> 2018-01-26T05:39:04.925435+00:00 32361
> 2018-01-26T05:39:04.925438+00:00 33587
> 2018-01-26T05:39:04.925440+00:00 8173
> 2018-01-26T05:39:04.925443+00:00 28149
> 2018-01-26T05:39:04.925445+00:00 29115
> 2018-01-26T06:30:27.643170+00:00 7423
> 2018-01-26T06:30:27.643172+00:00 32271
> 2018-01-26T06:30:27.643174+00:00 18927
> 2018-01-26T06:30:27.643179+00:00 27849
> 2018-01-26T06:30:37.644023+00:00 15160
> 2018-01-26T07:21:50.231681+00:00 19208
> 2018-01-26T07:21:50.231684+00:00 14984
> 2018-01-26T08:13:12.874501+00:00 16876
> 2018-01-26T08:13:22.885389+00:00 14007
> 2018-01-26T09:04:35.509167+00:00 9016
> 2018-01-26T09:04:35.509172+00:00 9356
> 2018-01-26T09:55:58.052277+00:00 24137
> 2018-01-26T09:55:58.052280+00:00 23709
> 2018-01-26T09:55:58.052282+00:00 19901
> 2018-01-26T09:55:58.052284+00:00 19034
> 2018-01-26T10:47:10.614261+00:00 23419
> 2018-01-26T10:47:10.614263+00:00 18967
> 2018-01-26T11:38:32.984318+00:00 14425
> 2018-01-26T11:38:32.984324+00:00 9797
> 2018-01-26T11:38:32.984326+00:00 11161
> 2018-01-26T11:38:32.984329+00:00 16228
> 2018-01-26T12:29:45.511517+00:00 15580
> 2018-01-26T12:29:45.511520+00:00 11439
> 2018-01-26T13:20:58.023816+00:00 21360
> 2018-01-26T13:20:58.023818+00:00 19488
> 2018-01-26T13:20:58.023821+00:00 14737
> 2018-01-26T13:20:58.023823+00:00 17118
> 
> 
> The question is: why does ATS regularily slow down? Are there some internal management
jobs done, that use same single queue as requests coming in and then requests just wait in
queue until internal processes finish?
> It's getting worse as ATS uptime increases, triggering HAproxy health check timeouts.
After restart it's not that bad again for a while.
> 
> How to get rid of this regular slowness?
> 
> Best regards,
> Veiko
> 
> 
> 
> 2018-01-23 13:53 GMT+02:00 Veiko Kukk <veiko.kukk@gmail.com <mailto:veiko.kukk@gmail.com>>:
> Hi,
> 
> I should have noted before that, during that timeframe, there is no higher disk activity
than on average. No higher load, no disk latency, no cpu load. Nothing abnormal except slow
ATS.
> ATS is running on CentOS 7 directly on hardware dedicated server.
> 
> Dirty pages related config that's been always there for that server:
> 
> vm.dirty_background_ratio = 5
> vm.dirty_ratio = 40
> vm.swappiness = 0
> 
> # free -m
>                    total           used           free          shared  buff/cache  
available
> Mem:         128831       27841        2436        4331       98554       95722
> Swap:          4095           0        4095
> 
> As you see, there is lot of available memory.
> 
> I don't see how writing dirty pages could slow down ATS when there is no indication of
excessive load on any of the system resources.
> And this strange regularity: every ~ 50 minutes this happens. Like some regular (cronjob
like) task is being ran inside ATS that delays all other tasks.
> 
> ATS is using 9TB raw partition if that information might be relevant.
> 
> Could you point me to the documentation dealing with dir entry sync periods of ATS?
> 
> 
> -- 
> Veiko
> 
> 
> 
> 2018-01-23 12:12 GMT+02:00 Leif Hedstrom <zwoop@apache.org <mailto:zwoop@apache.org>>:
> 
> 
>> On Jan 23, 2018, at 7:36 PM, Veiko Kukk <veiko.kukk@gmail.com <mailto:veiko.kukk@gmail.com>>
wrote:
>> 
>> Hi again,
>> 
>> During that mysterious task that happens after ~ 50-51 minutes causes requests/responses
to slow down very much, even time out.
>> Requests that usually take few hundred milliseconds are now taking over 30s and timing
out. This happens only during that time when memory consumption is suddenly dropped by ATS.
Happens for both bypassed urls and for hits.
>> ATS version is 7.1.1 and this looks like serious bug for me.
> 
> 
> That sounds suspiciously like kernel paging activity, maybe it’s spending that time
dumping flushing dirty pages?  Maybe transparent huge pages ? Or tweak the sysctl’s for
dirty page ratios?
> 
> The other thing to possibly look at is the dir entry sync periods of ATS. Whenever we
sync those to disk, we consume both more memory and more disk I/O, and maybe you are putting
too much pressure on the VM (i.e. maybe you need to turn down the RAM cache or tweak the amount
of directory entries you have).
> 
> — Leif
> 
> E.g.
> 
> https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
<https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/>
> 
> 
> 
> 


Mime
View raw message