mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Budnik <abud...@mesosphere.com>
Subject Re: Review Request 65759: Added inspect retries to the Docker executor.
Date Wed, 28 Feb 2018 16:49:28 GMT


> On Feb. 28, 2018, 10:04 a.m., Greg Mann wrote:
> > src/docker/executor.cpp
> > Lines 242-247 (patched)
> > <https://reviews.apache.org/r/65759/diff/3/?file=1965994#file1965994line242>
> >
> >     Doesn't this render the `onFailed` callback registered on L357 useless? i.e.,
the `inspect` future will never transition to the failed state?
> >     
> >     If so, we should probably either remove the `onFailed` callback, or do this
instead:
> >     ```
> >     if (!future.hasDiscard()) {
> >       return Break(future);
> >     }
> >     ```
> >     
> >     I guess the question is: if the inspect call actually fails, rather than hanging,
do we want to retry? It looks like there are several cases in `Docker::inspect` which will
result in a failure (failed to create subprocess, failed to read stdout, etc..), and it looks
to me like we could probably just retry in those cases. WDYT?

Good point! If a docker daemon returns non-zero, the docker libray will retry `inspect`, then
we'll get a message kile: 
`I0228 17:28:13.275115  3248 docker.cpp:1369] Retrying inspect with non-zero status code.
cmd: 'docker -H unix:///var/run/docker.sock inspect mesos-210b988c-c808-47e5-af65-75f40269755b',
interval: 500ms`

But if the docker library returns a failure itself due to some severe bug (failed to create
subprocess, failed to read stdout, etc...), then IMO we should stop retrying `inspect`:

```
       [](const Future<Docker::Container>& future)
          -> Future<ControlFlow<Docker::Container>> {
          if (future.isReady()) {
            return Break(future.get());
          }
          if (future.isFailed()) {
            return Failure(future.failure());
          }
          return Continue();
        });
```


- Andrei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65759/#review198382
-----------------------------------------------------------


On Feb. 22, 2018, 8:32 p.m., Andrei Budnik wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65759/
> -----------------------------------------------------------
> 
> (Updated Feb. 22, 2018, 8:32 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and Michael Park.
> 
> 
> Bugs: MESOS-8574
>     https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch adds retries for `inspect` command to workaround docker
> daemon hangs. We assume that the docker daemon can be temporarily
> unresponsive. If it's unresponsive, then any started docker cli
> command hangs. To address the issue, we retry `inspect` in the loop.
> 
> 
> Diffs
> -----
> 
>   src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 
> 
> 
> Diff: https://reviews.apache.org/r/65759/diff/4/
> 
> 
> Testing
> -------
> 
> internal CI
> 
> Manually, described in /r/65713
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message