mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Bannier <bbann...@apache.org>
Subject Re: Review Request 71641: Garbage-collected lost tasks which are reported as running again.
Date Mon, 28 Oct 2019 20:37:35 GMT


> On Oct. 28, 2019, 7:07 p.m., Benjamin Mahler wrote:
> > src/master/master.cpp
> > Lines 7848 (patched)
> > <https://reviews.apache.org/r/71641/diff/2/?file=2170613#file2170613line7848>
> >
> >     Hm.. don't we enforce agent removal by not allowing the agent to re-register?
> >     
> >     In the framework removal case, I guess we're not enforcing it?
> >     
> >     Having the task transition out of terminal seems a bit strange for those two
cases (are there other cases?)

One scenario where this can happen is maintenance where an agent goes `down` and then `up`
again after agent failover. The master will transition the tasks without waiting for task
status updates from the agent. This patch adds a test for that (which fails without the patch).

I could imagine scenarios involving framework teardown, agent failover, and framework registration
using the old `FrameworkID` as well when the master has already forgotten the ID.

This patch merely introduces a patch for possible inconsistencies due to the design; we should
fix the design as well, see e.g., MESOS-9940 which addresses one framework teardown edge case.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71641/#review218422
-----------------------------------------------------------


On Oct. 28, 2019, 6:53 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71641/
> -----------------------------------------------------------
> 
> (Updated Oct. 28, 2019, 6:53 p.m.)
> 
> 
> Review request for mesos, Benno Evers, Benjamin Mahler, and Greg Mann.
> 
> 
> Bugs: MESOS-10018
>     https://issues.apache.org/jira/browse/MESOS-10018
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Under certain conditions tasks which were previously `TASK_LOST` and
> completed can reappear in non-terminal states, e.g., if the agent on
> which they where running reconnect.
> 
> This patch adds garbage collection of such completed tasks so that users
> do not see tasks twice when obtaining task information from the master
> API. This change does not affect tasks status updates where we already
> correctly reported a previously `TASK_LOST` state as superseded by e.g.,
> `TASK_RUNNING`.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 351823e69f14dbb5eb1ea2b108c42e93722f1eff 
>   src/tests/master_tests.cpp 5486e23ce146eda9191e081a48c1f3fcb52a7569 
> 
> 
> Diff: https://reviews.apache.org/r/71641/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message