mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam B <a...@mesosphere.io>
Subject Re: Review Request 48744: Changed agent and scheduler authentication timeouts to ensure progress.
Date Fri, 17 Jun 2016 20:35:47 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48744/#review138308
-----------------------------------------------------------




src/sched/sched.cpp (line 435)
<https://reviews.apache.org/r/48744/#comment203480>

    Why is 5.9s good enough? Why not 10s or more?
    What's the point of the `- Milliseconds(100)`? They each start their timeouts at different
times, so they're already offset by a bit.
    Sched/slave starts its timer first, as soon as it initiates the authentication, presumably
immediately after sending an AuthenticateMessage.
    Then it takes some time for the network to transmit that message (<1s), which sits
in the master's event queue for a while (0-~30s), before Master::authenticate() is finally
called. The master might discard an old authentication request and defer the new request back
onto the queue (0-?s). Eventually, the master is ready to let the authenticator process the
request, after which the master starts its 5s timer.
    
    So, if the master's event queue is full, it could take >30s for the master to start
its timer, at which point the scheduler/agent has already timed out and sent new requests
multiple times, which means we trigger the "Queuing up authentication request" behavior excessively,
filling up the master's event queue even more. That's why I think some kind of exponential
backoff would work better than a hardcoded sched/slave timeout.


- Adam B


On June 15, 2016, 2:20 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48744/
> -----------------------------------------------------------
> 
> (Updated June 15, 2016, 2:20 p.m.)
> 
> 
> Review request for mesos, Adam B and Vinod Kone.
> 
> 
> Bugs: MESOS-2043
>     https://issues.apache.org/jira/browse/MESOS-2043
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The master, agent and scheduler all use the same value for when an
> authentication attempt times out. This can lead to situations where
> attempts time out on the master and e.g., an agent simultaneously.
> 
> If then the agent attempts another authentication while the master has
> not finished properly cleaning up the attempt the master would queue
> the new attempt behind the existing one, and subsequently notify the
> agent that the former attempt timed out. The agent on the other hand
> already timed out that attempt and is waiting for the new one to make
> progress.
> 
> Once the master and e.g., agent have entered this process they will
> likely move in lockstep, and it becomes highly unlikely for the agent
> to successfully authenticate.
> 
> Here we change the timeout used in the agent and scheduler to avoid
> this lockstep behavior. We allow for slightly more time on the
> agent/scheduler side before an attempt times out. We also use a value
> that makes sure that cycles of authentication attempt and timeout have
> very different periods on master and agent/scheduler.
> 
> 
> Diffs
> -----
> 
>   src/sched/sched.cpp 9f561d73a2e591afdc3ba4adb35a11763dced402 
>   src/slave/slave.cpp 0af04d6fe53f92e03905fb7b3bec72b09d5e8e57 
> 
> Diff: https://reviews.apache.org/r/48744/diff/
> 
> 
> Testing
> -------
> 
> Tested on internal CI on a collection of Linux setups.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message