mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qian Zhang <zhq527...@gmail.com>
Subject Re: Review Request 68495: Made command check always waits before removing the nested container.
Date Mon, 03 Sep 2018 14:20:44 GMT


> On Aug. 28, 2018, 1:16 a.m., Alexander Rukletsov wrote:
> > src/checks/checker_process.cpp
> > Lines 878-889 (original), 890-901 (patched)
> > <https://reviews.apache.org/r/68495/diff/1/?file=2077041#file2077041line890>
> >
> >     It looks like we should always call `waitNestedContainer()` after we said `previousCheckContainerId
= checkContainerId;`. For example here.
> >     
> >     Maybe it makes sense to call `waitNestedContainer()` right in the beginning?
We can end up calling it twice, but I think it's fine?
> 
> Qian Zhang wrote:
>     > Maybe it makes sense to call waitNestedContainer() right in the beginning? We
can end up calling it twice, but I think it's fine?
>     
>     That means we will call agent API `WAIT_NESTED_CONTAINER` twice for each successful
launch of check container, I think that might be a burden for agent in a large scale env.
So I'd still prefer to call it only in the places where we have to do it.
> 
> Qian Zhang wrote:
>     And for the case (L879:L888, i.e., the connection to agent failed) that you pointed
out above, I think when the connection to agent is back (e.g., agent starts up again), the
check container will be treated as orphan container and destroyed by agent, and then we will
remove it here: https://github.com/apache/mesos/blob/1.6.1/src/checks/checker_process.cpp#L616:L638.
However I am going to post another patch to change these codes (https://github.com/apache/mesos/blob/1.6.1/src/checks/checker_process.cpp#L660:L664)
to something like:
>     ```
>               promise->discard();
>             } else {
>               previousCheckContainerId = None();
>               _nestedCommandCheck(promise, cmd, nested);
>             }
>     ```
>     In this way, if we fail to remove the check container (e.g., due to agent has not
finished recovery, or the check container is still in `DESTROYING` state), we will try to
remove it again.
> 
> Qian Zhang wrote:
>     I posted another patch https://reviews.apache.org/r/68555/ as I mentioned above.
> 
> Alexander Rukletsov wrote:
>     So the assumption you're making is that if `Connection::send()` fails, the container
will not start and hence we don't need to call `wait()` before we destroy? It is not obvious
to a reader and requires some non-local knowledge about how the UCR works. Can you please
leave a comment documenting this assumption so that when it changes and someone will be trying
to figure out why checks sometimes started fail, they have a hint : )

> So the assumption you're making is that if Connection::send() fails, the container will
not start and hence we don't need to call wait() before we destroy?

Actually the check container could be started even if `Connection::send()` fails, e.g., agent
is killed right after the check container is launched but before the agent sends the launch
result to the default executor. My point is, in this case the checker container will eventually
destroyed by the agent after agent is started up again and we will eventually remove the check
container because of https://reviews.apache.org/r/68555/ (i.e., keep retrying the remove until
the check container is removed).


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68495/#review207984
-----------------------------------------------------------


On Aug. 24, 2018, 5:54 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68495/
> -----------------------------------------------------------
> 
> (Updated Aug. 24, 2018, 5:54 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik, Alexander Rukletsov, Gastón Kleiman, and Gilbert
Song.
> 
> 
> Bugs: MESOS-8568
>     https://issues.apache.org/jira/browse/MESOS-8568
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Made command check always waits before removing the nested container.
> 
> 
> Diffs
> -----
> 
>   src/checks/checker_process.cpp 77a76f465fe57eab89f027b5acb74c2339551678 
> 
> 
> Diff: https://reviews.apache.org/r/68495/diff/1/
> 
> 
> Testing
> -------
> 
> sudo make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message