mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Rukletsov <ruklet...@gmail.com>
Subject Re: Review Request 61530: Enabled retries for `killTasks` in docker executor.
Date Thu, 10 Aug 2017 12:07:06 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61530/#review182573
-----------------------------------------------------------




src/docker/executor.cpp
Lines 31 (patched)
<https://reviews.apache.org/r/61530/#comment258494>

    We sort includes alphabetically.



src/docker/executor.cpp
Lines 410-415 (original), 416-421 (patched)
<https://reviews.apache.org/r/61530/#comment258498>

    Let's add a comment explaining why are we doing retry / unblock on failure and why not
timeout. Something like:
    ```
    // Invoking `docker stop` might be unsuccessful, in which case the container most probably
does not receive the signal. In this case we should allow schedulers to retry the kill operation
or, if the kill was initiated by a failing health check, retry ourselves. We do not bail out
nor stop retrying to avoid sending a terminal status update while the container might still
be running.
    //
    // NOTE: `docker stop` might also hang. We do not address this for now, because there
is no evidence that in this case docker daemon might funciton properly, i.e., it is only the
docker cli command that hangs, and hence there is not so much we can do.
    ```



src/docker/executor.cpp
Lines 427-428 (patched)
<https://reviews.apache.org/r/61530/#comment258499>

    Please no "magic constants" in the code. Add it to the existing section above. `5s` sounds
reasonable to me.



src/docker/executor.cpp
Lines 604-606 (original), 616-618 (patched)
<https://reviews.apache.org/r/61530/#comment258497>

    We should augment the comment here explaining why we need `killAttemptInProgress` or whatever
we end up having.



src/docker/executor.cpp
Lines 606-608 (original), 618-621 (patched)
<https://reviews.apache.org/r/61530/#comment258496>

    Instead of `killed` flag, which can be rolled back, and `firstKillAttempt`, that kinda
indicates that a kill has been issued before, how about having `killed` and `killAttemptInProgress`
with the following semantics:
    - After task transitions to `killed`, there is no way back. This corresponds to our intent.
This also fixes the race with the health update that might come in after we set `killed =
false` to unblock schedulers.
    - `killAttemptInProgress` will guard kill requests from schedulers and will be adjusted
accordingly.


- Alexander Rukletsov


On Aug. 9, 2017, 4:55 p.m., Andrei Budnik wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61530/
> -----------------------------------------------------------
> 
> (Updated Aug. 9, 2017, 4:55 p.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-6743
>     https://issues.apache.org/jira/browse/MESOS-6743
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Previously, after `docker stop` command failure, docker executor
> neither allowed a scheduler to retry `killTask` command, nor retried to
> kill a task when task killing was triggered by a failed health check.
> 
> 
> Diffs
> -----
> 
>   src/docker/executor.cpp 26f12ec002f754fab0d34c01472cf95b499d8007 
> 
> 
> Diff: https://reviews.apache.org/r/61530/diff/1/
> 
> 
> Testing
> -------
> 
> sudo make check
> 
> Manual testing:
> 
> Emulating `docker stop` errors:
> ===============================
> 1. Add `return fmt.Errorf("Emulating error!")` to https://github.com/docker/docker-ce/blob/master/components/engine/daemon/stop.go#L21
> 2. build docker from sources: http://oyvindsk.com/writing/docker-build-from-source
> 3. stop docker service and launch dockerd binary, like: `sudo ./bundles/17.06.0-dev/binary-daemon/dockerd`
> 
> Emulating docker daemon hang:
> =============================
> 1. `ps aux|grep dockerd` - 2 processes will be found
> 2. `sudo kill -STOP <PID1> <PID2>` - send SIGSTOP to docker daemon processes
just before sending `docker stop`
> 
> Emulating health check failure in docker executor:
> ==================================================
> 1. Add
> ```c++
>   static int fake = 0;
>   if (++fake > 10) {
>     failure();
>     return;
>   }
> ```
> to `HealthChecker::processCheckResult()` in `src/checks/health_checker.cpp`
> 2. Add
> ```c++
>        HealthCheck healthCheck;
>        healthCheck.set_type(HealthCheck::COMMAND);
>        healthCheck.mutable_command()->set_value("exit 0");
>        healthCheck.set_delay_seconds(0);
>        healthCheck.set_interval_seconds(0);
>        healthCheck.set_grace_period_seconds(1);
>        _task.mutable_health_check()->CopyFrom(healthCheck);
> ```
> to `CommandScheduler::offers()` in `src/cli/execute.cpp`
> 3. compile mesos
> 4. run mesos agent: `sudo GLOG_v=1 ./bin/mesos-agent.sh --resources="cpus:10000;mem:1000000"
--work_dir='/home/some_user/mesos/build/var/agent-1' --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 5. launch docker executor: `./src/mesos-execute --master="127.0.1.1:5050" --name="a"
--containerizer=docker --docker_image="ubuntu:xenial" --command="while true; do : ; done"`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message