mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [mesos] cf-natali commented on pull request #388: Fixed a bug where the cgroup task killer leaves the cgroup frozen.
Date Fri, 28 May 2021 18:46:43 GMT

cf-natali commented on pull request #388:
URL: https://github.com/apache/mesos/pull/388#issuecomment-850602132


   > @ridv This, unfortunately, is not a "small patch", but an attempt to compensate for
a deficiency in a workaround in a code that uses a rather problematic kernel interface and
is already full of hacks and short-term solutions.
   
   Indeed, this freezer code is quite tricky, and with good reason - at work we experienced
3 different kernel bugs involving the freezer cgroup, it's really a mess even after all those
years...
   
   > @cf-natali I'm trying to look into this and to make sure this is not making some other
issues worse.
   > 
   > For example, does this all mean that this deferred call that freezes a cgroup
   > 
   > https://github.com/apache/mesos/blob/85981d7c728798783b2e1e7cbed4f27a1497ccb2/src/linux/cgroups.cpp#L1440
   > 
   > has never been executed before this patch (due to immediate termination of a TaskKiller
process), but, after this change, will be executing in some cases?...
   
   As far as I can tell, this should not exercise any new code path, and should actually make
unusual code paths less frequent.
   The code you're referring to is only involved when the freeze operation timed out, and
attempts to thaw and freeze it again - this can happen in normal circumstances, i.e. in the
normal case where a cgroup is destroyed. This specific code path was exercised in the past
- I checked using logs and `strace`, and will continue to be exercised if the freeze times
out.
   
   One way to view this change is that it can only delay the termination of the `TaskKiller`
process: therefore, the `TaskKiller` will only terminate in a state in which it could already
be terminated in in the current code.
   The extra delay is a grace period letting some time to the `TaskKiller` to finish its freeze/kill/thaw
cycle, hence making it much less likely to terminate it while it's in the middle of its freeze/kill/thaw.
   
   > (Btw, inablilty to comment on an untouched line in a github PR is one of examples
why, as of 2019-2020, Mesos committers were still preferring to review contribution using
Apache Reviewboard. I'm not saying that we should not do something to simplify contributing
via GitHub, just illustrating why the RB has not been phased out.)
   
   Interesting, thanks!
   
   > I'm sorry, but reviewing such stuff meaningfully takes significant chunks of an uninterrupted
time and should only be done in a very sound mind and a very good health (this was not my
case in the beginnig of this week).
   
   I'm really sorry to hear that, and hope you're feeling better!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message