mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Mann <g...@mesosphere.io>
Subject Re: Review Request 65482: Improved handling of non-terminal operations after master failover.
Date Thu, 15 Feb 2018 23:59:46 GMT


> On Feb. 15, 2018, 11:55 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 7647-7654 (patched)
> > <https://reviews.apache.org/r/65482/diff/3/?file=1961521#file1961521line7647>
> >
> >     I'm sitting here trying to think of ways we might avoid crashing if the framework
subscribes before the operation becomes terminal...
> >     
> >     Would it be reasonable to add an `if (framework == nullptr)` check to `updateOperation()`
so that we only recover resources if the framework is known to the master?

Er... wait that doesn't make sense :) I guess when we receive the operation update, we have
no way of knowing whether or not the framework had subscribed when the master learned about
the pending operation. As a workaround for now, we could store in a set the operation UUIDs
of operations for which we do not track allocated resources (i.e., operations which hit this
block of code). Then, in `updateOperation` we could avoid recovering resources if the operation's
UUID is in the set?


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65482/#review197641
-----------------------------------------------------------


On Feb. 14, 2018, 2:21 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65482/
> -----------------------------------------------------------
> 
> (Updated Feb. 14, 2018, 2:21 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8536
>     https://issues.apache.org/jira/browse/MESOS-8536
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch fixes the handling of non-terminal operations learned by a
> newly elected master after a master failover, so that only these
> operations are counted as using resources. Previously we did not count
> any operations as using resources which by accident produced expected
> behavior if the operation was already terminal when the master learned
> about them.
> 
> We do not address the issue of being unable to properly account for
> operations triggered by frameworks unknown to the master, see
> MESOS-8582. Instead we emit a warning for now since the master might
> continue to abort due to assertion failures due to incomplete resource
> accounting.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp b06d7a6e2fbbb81b97eaf537d5b6745c73dc867d 
> 
> 
> Diff: https://reviews.apache.org/r/65482/diff/3/
> 
> 
> Testing
> -------
> 
> `make check`, also tested with a version of the test added in r/65045 which triggered
this issue.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message