mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Rukletsov <ruklet...@gmail.com>
Subject Re: Review Request 61530: Enabled retries for `killTasks` in docker executor.
Date Thu, 10 Aug 2017 12:09:00 GMT


> On Aug. 10, 2017, 12:07 p.m., Alexander Rukletsov wrote:
> > src/docker/executor.cpp
> > Lines 410-415 (original), 416-421 (patched)
> > <https://reviews.apache.org/r/61530/diff/1/?file=1794166#file1794166line416>
> >
> >     Let's add a comment explaining why are we doing retry / unblock on failure and
why not timeout. Something like:
> >     ```
> >     // Invoking `docker stop` might be unsuccessful, in which case the container
most probably does not receive the signal. In this case we should allow schedulers to retry
the kill operation or, if the kill was initiated by a failing health check, retry ourselves.
We do not bail out nor stop retrying to avoid sending a terminal status update while the container
might still be running.
> >     //
> >     // NOTE: `docker stop` might also hang. We do not address this for now, because
there is no evidence that in this case docker daemon might funciton properly, i.e., it is
only the docker cli command that hangs, and hence there is not so much we can do.
> >     ```

We should also refer to https://issues.apache.org/jira/browse/MESOS-6743 in the comment so
that folks can get more context.


- Alexander


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61530/#review182573
-----------------------------------------------------------


On Aug. 9, 2017, 4:55 p.m., Andrei Budnik wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61530/
> -----------------------------------------------------------
> 
> (Updated Aug. 9, 2017, 4:55 p.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-6743
>     https://issues.apache.org/jira/browse/MESOS-6743
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Previously, after `docker stop` command failure, docker executor
> neither allowed a scheduler to retry `killTask` command, nor retried to
> kill a task when task killing was triggered by a failed health check.
> 
> 
> Diffs
> -----
> 
>   src/docker/executor.cpp 26f12ec002f754fab0d34c01472cf95b499d8007 
> 
> 
> Diff: https://reviews.apache.org/r/61530/diff/1/
> 
> 
> Testing
> -------
> 
> sudo make check
> 
> Manual testing:
> 
> Emulating `docker stop` errors:
> ===============================
> 1. Add `return fmt.Errorf("Emulating error!")` to https://github.com/docker/docker-ce/blob/master/components/engine/daemon/stop.go#L21
> 2. build docker from sources: http://oyvindsk.com/writing/docker-build-from-source
> 3. stop docker service and launch dockerd binary, like: `sudo ./bundles/17.06.0-dev/binary-daemon/dockerd`
> 
> Emulating docker daemon hang:
> =============================
> 1. `ps aux|grep dockerd` - 2 processes will be found
> 2. `sudo kill -STOP <PID1> <PID2>` - send SIGSTOP to docker daemon processes
just before sending `docker stop`
> 
> Emulating health check failure in docker executor:
> ==================================================
> 1. Add
> ```c++
>   static int fake = 0;
>   if (++fake > 10) {
>     failure();
>     return;
>   }
> ```
> to `HealthChecker::processCheckResult()` in `src/checks/health_checker.cpp`
> 2. Add
> ```c++
>        HealthCheck healthCheck;
>        healthCheck.set_type(HealthCheck::COMMAND);
>        healthCheck.mutable_command()->set_value("exit 0");
>        healthCheck.set_delay_seconds(0);
>        healthCheck.set_interval_seconds(0);
>        healthCheck.set_grace_period_seconds(1);
>        _task.mutable_health_check()->CopyFrom(healthCheck);
> ```
> to `CommandScheduler::offers()` in `src/cli/execute.cpp`
> 3. compile mesos
> 4. run mesos agent: `sudo GLOG_v=1 ./bin/mesos-agent.sh --resources="cpus:10000;mem:1000000"
--work_dir='/home/some_user/mesos/build/var/agent-1' --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 5. launch docker executor: `./src/mesos-execute --master="127.0.1.1:5050" --name="a"
--containerizer=docker --docker_image="ubuntu:xenial" --command="while true; do : ; done"`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message