mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Mann <g...@mesosphere.io>
Subject Re: Review Request 71641: Garbage-collected lost tasks which are reported as running again.
Date Tue, 29 Oct 2019 23:35:48 GMT


> On Oct. 28, 2019, 6:07 p.m., Benjamin Mahler wrote:
> > src/master/master.cpp
> > Lines 7848 (patched)
> > <https://reviews.apache.org/r/71641/diff/2/?file=2170613#file2170613line7848>
> >
> >     Hm.. don't we enforce agent removal by not allowing the agent to re-register?
> >     
> >     In the framework removal case, I guess we're not enforcing it?
> >     
> >     Having the task transition out of terminal seems a bit strange for those two
cases (are there other cases?)
> 
> Benjamin Bannier wrote:
>     One scenario where this can happen is maintenance where an agent goes `down` and
then `up` again after agent failover. The master will transition the tasks without waiting
for task status updates from the agent. This patch adds a test for that (which fails without
the patch).
>     
>     I could imagine scenarios involving framework teardown, agent failover, and framework
registration using the old `FrameworkID` as well when the master has already forgotten the
ID.
>     
>     This patch merely introduces a patch for possible inconsistencies due to the design;
we should fix the design as well, see e.g., MESOS-9940 which addresses one framework teardown
edge case.
> 
> Benjamin Mahler wrote:
>     Ok, perhaps the patch and comment can be re-framed? "Garbage-collect" sounds like
cleaning up old unneeded data, but this is a mitigation papering over possible inconsistency
that can arise due flawed design (i.e. lack of enforcement of actions that the master is taking,
or in the case of MESOS-9940 probably the master should defer to the agent for the outcome).
>     
>     Tasks are not supposed to be coming out of KILLED (is this possible for other states
too?). Perhaps the comment should clarify all exact known cases where this is possible?
>     
>     Perhaps we should also be logging any actual removals as warnings in the log to highlight
that it happened?

> Tasks are not supposed to be coming out of KILLED (is this possible for other states
too?). Perhaps the comment should clarify all exact known cases where this is possible?

Should we be asserting that the task is in an expected state?


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71641/#review218422
-----------------------------------------------------------


On Oct. 28, 2019, 5:53 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71641/
> -----------------------------------------------------------
> 
> (Updated Oct. 28, 2019, 5:53 p.m.)
> 
> 
> Review request for mesos, Benno Evers, Benjamin Mahler, and Greg Mann.
> 
> 
> Bugs: MESOS-10018
>     https://issues.apache.org/jira/browse/MESOS-10018
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Under certain conditions tasks which were previously `TASK_LOST` and
> completed can reappear in non-terminal states, e.g., if the agent on
> which they where running reconnect.
> 
> This patch adds garbage collection of such completed tasks so that users
> do not see tasks twice when obtaining task information from the master
> API. This change does not affect tasks status updates where we already
> correctly reported a previously `TASK_LOST` state as superseded by e.g.,
> `TASK_RUNNING`.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 351823e69f14dbb5eb1ea2b108c42e93722f1eff 
>   src/tests/master_tests.cpp 5486e23ce146eda9191e081a48c1f3fcb52a7569 
> 
> 
> Diff: https://reviews.apache.org/r/71641/diff/3/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message