mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [mesos] cf-natali commented on pull request #388: Fixed a bug where the cgroup task killer leaves the cgroup frozen.
Date Sat, 29 May 2021 19:47:59 GMT

cf-natali commented on pull request #388:
URL: https://github.com/apache/mesos/pull/388#issuecomment-850889402


   > That said, I'm not convinced that postponing c) is the best available option.
   > 
   > Maybe we should consider replacing c) with another termination mechanism that will
be **guaranteed** not to halt this loop in the middle of an iteration? Something like ~setting
a flag by that discard callback and checking this flag~ checking `future.hasDiscard()` inside
this loop before calling `cgroups::freezer:freeze()` ?
   
   Yes, I wondered about this, and actually I initially wanted to go for this approach to
never interrupt of the middle of an iteration, however I chose the approach in this MR because:
   - it's simpler
   - I don't feel too comfortable changing this code as it's already the result of several
workarounds, so as you said earlier it's not clear that any change won't break some subtle
corner-case
   - but more fundamentally, the reason is that if the freezing fails, then I think pretty
much all bets are off. For example in case of timeout upon freeze, I'm not convinced that
the call to `TaskKiller::kill` is safe because by definition some tasks are not frozen hence
the kill won't be atomic and it might be possible that because of PID reuse we end up killing
the wrong process. Another thing is that if freezing failed, I am not sure how reliable the
thawing or if the kernel might just discard it which will also leave the cgroup in a bad state.
   
   However I'm open to trying it if you prefer the non-interruption approach.
   
   It's so sad that there is no simple and reliable way to do this. AFAICT systemd doesn't
use the freezer cgroup because of its various issues - it actually suffers from PID reuse
race conditions, looks like they are considering using freezer in cgroupsv2: https://github.com/systemd/systemd/issues/13101
   
   And sadly, while PID namespace mentions in the various comments sounds promising, the problem
is that it could definitely break some tasks so it's not like we can just switch to using
it.
    
   > Btw, how are you testing the code handling freeze/kill failures?
   > In my experience, the simplest way to reliably create a process in an uninterruptable
sleep (D state) is to create a file on some FUSE-backed FS (say, sshfs), mmap that file, stop
the program backing the FS with SIGSTOP, and repeatedly read/write the mmap-ed bytes until
the process eventually hangs. That's rather complicated, takes 30 seconds of read/write on
my kernel and is not really usable in tests... I'm wondering if there is a simpler option.
   
   It depends.
   If the goal is really to create a process in uninterruptible sleep, then yes I'm not aware
of any way much simpler - typically I'd use a write to an hard NFS mount, and use e.g. iptables
to drop the packets and cause the process to hang.
   
   However I don't believe it's quite necessary to test the freezer code.
   For this change, I just re-ran some cgroups related tests in a loop because it was enough
to trigger the issue.
   
   For the previous changes I've made to the freezer code - some workarounds for kernel bugs
- I'd generally just use `strace --inject` to either introduce faults or delays, e.g.:
   ```
   cf@thinkpad:~$ strace -ttT --inject=read:delay_enter=1s /bin/cat /etc/fstab >/dev/null

   [...]
   20:34:21.932568 openat(AT_FDCWD, "/etc/fstab", O_RDONLY) = 3 <0.000041>
   [...]
   20:34:21.932936 read(3, "# /etc/fstab: static file system"..., 131072) = 788 <1.000289>
   20:34:22.933402 write(1, "# /etc/fstab: static file system"..., 788) = 788 <0.000037>
   20:34:22.933580 read(3, "", 131072)     = 0 <1.000115>
   20:34:23.933857 munmap(0x7ff03b46b000, 139264) = 0 <0.000074>
   [...]
   ```
   
   This can be used to e.g. cause timeouts while reading/writing to `freezer.state` file,
which is enough to exercise some interesting code paths.
   And I quite like it because it's very generic and can be used for many problems.
   
   For something purely freezer-specific, another way I can think of to trigger freezer errors,
and which can actually probably be done as part of an automated test, is to use nested cgroups:
   ```
   root@thinkpad:~# # create parent cgroup
   root@thinkpad:~# mkdir /sys/fs/cgroup/freezer/parent
   root@thinkpad:~# # create child cgroup
   root@thinkpad:~# mkdir /sys/fs/cgroup/freezer/parent/child
   root@thinkpad:~# # freeze parent - since freezing is recursive, the child will be frozen
   root@thinkpad:~# echo FROZEN > /sys/fs/cgroup/freezer/parent/freezer.state
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/freezer.state 
   FROZEN
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/child/freezer.state 
   FROZEN
   root@thinkpad:~# # attempt to thaw the child
   root@thinkpad:~# echo THAWED > /sys/fs/cgroup/freezer/parent/child/freezer.state 
   root@thinkpad:~# # not working since parent is still frozen
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/child/freezer.state 
   FROZEN
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/freezer.state 
   FROZEN
   root@thinkpad:~# # thaw parent, and child thaws as well
   root@thinkpad:~# echo THAWED > /sys/fs/cgroup/freezer/parent/freezer.state 
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/freezer.state 
   THAWED
   root@thinkpad:~# cat /sys/fs/cgroup/freezer/parent/child/freezer.state 
   THAWED
   root@thinkpad:~# 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message