mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benno Evers <bev...@mesosphere.com>
Subject Re: Review Request 67403: Handled race condition when removing maintenance windows.
Date Mon, 04 Jun 2018 14:46:48 GMT


> On May 31, 2018, 4:41 p.m., Vinod Kone wrote:
> > Can you add a unit test for this?
> 
> Benno Evers wrote:
>     It's tricky because we need very precise control over the scheduling, and I'm not
sure our testing infrastructure provides it. But I'll look into it.
> 
> Vinod Kone wrote:
>     I see.  The bug is in the allocator, so you cannot use a mock allocator unfortunately
for control. Consider pausing the clock to have better control in the test.
> 
> Benno Evers wrote:
>     After discussing with Benjamin Bannier, we came to the conclusion that it's currently
not possible to write a unit test for this scenario, because we're lacking the capability
to intercept a dispatch and re-insert it into the event queue at a later time.
> 
> Joseph Wu wrote:
>     I gave writing the test a shot... and I think it might be possible, but the resulting
test would be too fragile to be a regression test.
>     
>     Here's my (not working yet) attempt: https://github.com/kaysoky/mesos/commit/29c6a1807d65d01440b7c67a73062ae9af892afe

Do you plan to continue working on that, or should we go ahead and commit the fix?


- Benno


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67403/#review204121
-----------------------------------------------------------


On June 1, 2018, 2:17 p.m., Benno Evers wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67403/
> -----------------------------------------------------------
> 
> (Updated June 1, 2018, 2:17 p.m.)
> 
> 
> Review request for mesos, Joseph Wu and Vinod Kone.
> 
> 
> Bugs: MESOS-7966
>     https://issues.apache.org/jira/browse/MESOS-7966
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When executing the `Master::inverseOffers()` callback, it
> could happen that the maintenance window the reverse offer
> referred to was already removed by a concurrent call to
> to the maintenance endpoint of Mesos.
> 
> In this case, we must not send out a reverse offer, because
> having outstanding inverse offers for an agent without
> any scheduled maintenance window will lead to a crash in
> the allocator when attempting to remove this offer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp ba3f8746ea393c8655fcd5ceaace099f68df0b19 
> 
> 
> Diff: https://reviews.apache.org/r/67403/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Set up the reproduction environment locally and ran `while :; python call.py; done` for
about a minute. (see linked ticket)
> 
> 
> Thanks,
> 
> Benno Evers
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message