mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 59320: Added test case for agent re-registration.
Date Tue, 13 Jun 2017 18:36:22 GMT


> On May 31, 2017, 12:09 a.m., Benjamin Mahler wrote:
> > src/master/master.cpp
> > Lines 6018-6025 (original), 6018-6025 (patched)
> > <https://reviews.apache.org/r/59320/diff/2/?file=1721802#file1721802line6018>
> >
> >     Hm.. what does slaveWasRemoved mean? Is it equivalen to the agent being marked
as unreachable? Or is being removed something beyond being marked as unreachable (what it
sounds like)? I had a hard time grasping the code below when it was referring to "unreachable"
but the variable is called "removed".

`slaveWasRemoved` means that _this_ instance of the master was the one that removed the agent
(because `removed` is only a cache, there might be false negatives). Similarly, `removed`
contains agents that _this_ instance of the master has marked unreachable. No objection to
renaming both variables, as long as we keep them in sync, although I can't think of a ton
of better names.


> On May 31, 2017, 12:09 a.m., Benjamin Mahler wrote:
> > src/master/master.cpp
> > Lines 6027-6043 (original), 6027-6043 (patched)
> > <https://reviews.apache.org/r/59320/diff/2/?file=1721802#file1721802line6027>
> >
> >     Hm.. it seems like we could maybe make this code easier to read. Currently,
the loop for dealing with frameworks is down below and the loop for dealing with tasks is
up here. However, it seems to me that these are highly related, and it would be easier to
reason about if it were consolidated into a loop over the framework and then a loop over the
tasks belonging to that framework. That would also eliminate the need for `partitionAwareFrameworks`.
> >     
> >     E.g.
> >     
> >     ```
> >       foreach (const FrameworkInfo& framework, frameworks):
> >         if completed:
> >           shut down the framework, drop the tasks
> >           continue
> >           
> >         if agent was unreachable/removed and framework is not partition aware:
> >           shut down the framework, drop the tasks
> >           continue
> >         
> >         for each task in the framework:
> >           recoveredTasks.push_back(task)
> >           remove from unreachable tasks
> >     ```

Hmmm. I can see how this would makes things clearer, but it might result in some subtle behavioral
differences. Right now, the master notifies the agent that it has re-registered, then sends
it `ShutdownFrameworkMessage`s in some cases. Adopting the rewrite you suggested would mean
shutting down frameworks while the agent thinks it hasn't re-registered yet. Not wrong but
confusing.

Since this code is likely to change (see the "RFC: Partition Awareness" thread on `dev`),
I'm inclined to leave this as-is in the short term.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59320/#review176399
-----------------------------------------------------------


On May 31, 2017, 8:35 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59320/
> -----------------------------------------------------------
> 
> (Updated May 31, 2017, 8:35 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Add an explanatory comment and a test case for a particular case in the
> master's logic for handling agent re-registration.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a5376f9fc3c3c6c72b22cfd36ec56a1de0d03a3a 
>   src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 
> 
> 
> Diff: https://reviews.apache.org/r/59320/diff/3/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Checked that if you add `CHECK_NOTNULL(framework)` to the master code (circa the newly
added comment), the incorrect `CHECK` is triggered by this test case (but not by any existing
test cases).
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message