mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [mesos] cf-natali opened a new pull request #362: Keep retrying to remove cgroup on EBUSY.
Date Sat, 02 May 2020 00:53:04 GMT

cf-natali opened a new pull request #362:
URL: https://github.com/apache/mesos/pull/362


   This is a follow-up to MESOS-10107, which introduced retries when
   calling rmdir on a seemingly empty cgroup fails with EBUSY because of
   various kernel bugs.
   At the time, the fix introduced a bounded number of retries, using an
   exponential backoff summing up to slightly over 1s. This was done
   because it was similar to what Docker does, and worked during testing.
   However, after 1 month without seeing this error in our cluster at work,
   we finally experienced one case where the 1s timeout wasn't enough.
   It could be because the machine was busy at the time, or some other
   rnadom factor.
   So instead of only trying for 1s, I think it might make sense to just
   keep retrying, until the top-level container destruction timeout - set
   at 1 minute - kicks in.
   This actually makes more sense, and avoids having a magical timeout in
   the cgroup code.
   We just need to ensure that when the destroyer is finalised, it discards
   the future in charge of doing the periodic remove.
   
   
   Here are the logs of the problem we've seen in our cluster, for reference - you can see
that the cgroup destruction fails even after 1s:
   ```
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523308 4942 containerizer.cpp:3179]
Container 34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 has reached its limit for resource [{"name":"mem","scalar":{"value":16320.0},"type":"SCALAR"}]
and will be terminated
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523384 4942 containerizer.cpp:2623]
Destroying container 34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 in RUNNING state
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523396 4942 containerizer.cpp:3321]
Transitioning the state of container 34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 from RUNNING to
DESTROYING after 56.8612528682667mins
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523541 4942 linux_launcher.cpp:564]
Asked to destroy container 34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523587 4942 linux_launcher.cpp:606]
Destroying cgroup '/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244'
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523954 4942 cgroups.cpp:2887]
Freezing cgroup /sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.524153 4954 cgroups.cpp:1275]
Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
after 189184ns
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.524430 4954 cgroups.cpp:2905]
Thawing cgroup /sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.524525 4954 cgroups.cpp:1304]
Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
after 87808ns
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.533007 4928 slave.cpp:6616] Got
exited event for executor(1)@172.16.20.99:36313
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.557977 4950 containerizer.cpp:3159]
Container 34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 has exited
   May  1 11:52:02 host033 mesos-slave[2256]: E0501 11:52:02.583150 4956 slave.cpp:6994] Termination
of executor 'secure_executor:13954144-cbcd-6bd4-8e37-af2301ec510d' of framework c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000
failed: Failed to kill all processes in the container: Failed to remove cgroup 'mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244':
Failed to remove cgroup '/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244':
Device or resource busy
   May  1 11:52:02 host033 mesos-slave[2256]: I0501 11:52:02.583232 4956 slave.cpp:5890] Handling
status update TASK_FAILED (Status UUID: cc2899ab-c534-4eeb-a1a4-f28102fc3ca4) for task 13954144-cbcd-6bd4-8e37-af2301ec510d
of framework c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000 from @0.0.0.0:0
   ```
   
   @abudnik @qianzhangxa 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message