mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Bannier <benjamin.bann...@mesosphere.io>
Subject Re: Review Request 65482: Improved handling of non-terminal operations after master failover.
Date Fri, 16 Feb 2018 14:13:08 GMT


> On Feb. 16, 2018, 12:55 a.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 7645 (patched)
> > <https://reviews.apache.org/r/65482/diff/3/?file=1961521#file1961521line7645>
> >
> >     s/`RunTaskMessage`, see/`RunTaskMessage`. See/

Hmm ... I am not a native speaker, but wouldn't starting an extra sentence here needlessly
inflate this comment (e.g., we would likely want some verb as well then)? We seem to use this
pattern elsewhere already,

    % git grep ', see' src | wc -l
    77


> On Feb. 16, 2018, 12:55 a.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 7647-7654 (patched)
> > <https://reviews.apache.org/r/65482/diff/3/?file=1961521#file1961521line7647>
> >
> >     I'm sitting here trying to think of ways we might avoid crashing if the framework
subscribes before the operation becomes terminal...
> >     
> >     Would it be reasonable to add an `if (framework == nullptr)` check to `updateOperation()`
so that we only recover resources if the framework is known to the master?
> 
> Greg Mann wrote:
>     Er... wait that doesn't make sense :) I guess when we receive the operation update,
we have no way of knowing whether or not the framework had subscribed when the master learned
about the pending operation. As a workaround for now, we could store in a set the operation
UUIDs of operations for which we do not track allocated resources (i.e., operations which
hit this block of code). Then, in `updateOperation` we could avoid recovering resources if
the operation's UUID is in the set?

No matter what we do here, we will already have entered dangerous territory with `CHECK` failures
looming as soon as we added such an operation to master state, since we cannot update the
allocator on these used resources (update the allocation to reflect operation results, or
recover). We might also end up unknowingly oversubscribing the resources used by the operation.

I am unsure whether working around this by e.g., no updating the allocator when the operation
becomes terminal is a good way forward since it seems to delay the needed master failover
for even longer. With the approach I proposed we would failover when this operation gets terminal
(i.e., when we can safely reconcile this particular operation).

I added the framework ID to the log output since it might be useful for operators debugging
such scenarios.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review197641
-----------------------------------------------------------


On Feb. 16, 2018, 3:12 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> -----------------------------------------------------------
> 
> (Updated Feb. 16, 2018, 3:12 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
>     https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/4/
> 
> 
> Testing
> -------
> 
> `make check`, also tested with a version of the test added in r/65045 which triggered
this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message