mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Park" <mcyp...@gmail.com>
Subject Re: Review Request 35702: Added /reserve HTTP endpoint to the master.
Date Wed, 05 Aug 2015 09:51:47 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35702/
-----------------------------------------------------------

(Updated Aug. 5, 2015, 9:51 a.m.)


Review request for mesos, Adam B, Benjamin Hindman, Ben Mahler, Jie Yu, Joris Van Remoortere,
and Vinod Kone.


Changes
-------

Addressed Jie's comments.


Bugs: MESOS-2600
    https://issues.apache.org/jira/browse/MESOS-2600


Repository: mesos


Description
-------

This involved a lot more challenges than I anticipated, I've captured the various approaches
and limitations and deal-breakers of those approaches here: [Master Endpoint Implementation
Challenges](https://docs.google.com/document/d/1cwVz4aKiCYP9Y4MOwHYZkyaiuEv7fArCye-vPvB2lAI/edit#)

Key points:

* This is a stop-gap solution until we shift the offer creation/management logic from the
master to the allocator.
* `updateAvailable` and `updateSlave` are kept separate because
  (1) `updateAvailable` is allowed to fail whereas `updateSlave` must not.
  (2) `updateAvailable` returns a `Future` whereas `updateSlave` does not.
  (3) `updateAvailable` never leaves the allocator in an over-allocated state and must not,
whereas `updateSlave` does, and can.
* The algorithm:
    * Initially, the master pessimistically assume that what seems like "available" resources
will be gone.
      This is due to the race between the allocator scheduling an `allocate` call to itself
vs master's `allocator->updateAvailable` invocation.
      As such, we first try to satisfy the request only with the offered resources.
    * We greedily rescind one offer at a time until we've rescinded sufficiently many offers.
      IMPORTANT: We perform `recoverResources(..., Filters())` rather than `recoverResources(...,
None())` so that we can pretty much always win the race against `allocate`.
                 In the case that we lose, no disaster occurs. We simply fail to satisfy the
request.
    * If we still don't have enough resources after resciding all offers, be optimistic and
forward the request to the allocator since there may be available resources to satisfy the
request.
    * If the allocator returns a failure, report the error to the user with `PreconditionFailed`.
This could be updated to be `Forbidden`, or `Conflict` maybe as well. We'll pick one eventually.

This approach is clearly not ideal, since we would prefer to rescind as little offers as possible.
The challenges of implementing the ideal solution in the current state is described in the
document above.

TODO(mpark): Add more comments and test cases.


Diffs (updated)
-----

  src/master/http.cpp 76e70801925041f08bc94f0ca18c86f1a573b2b3 
  src/master/master.hpp e44174976aa64176916827bec4c911333c9a91db 
  src/master/master.cpp 5aa0a5410804fe16abd50b6953f1ffe46a019ecf 
  src/master/validation.hpp 43b8d84556e7f0a891dddf6185bbce7ca50b360a 
  src/master/validation.cpp ffb7bf07b8a40d6e14f922eabcf46045462498b5 

Diff: https://reviews.apache.org/r/35702/diff/


Testing
-------

`make check`


Thanks,

Michael Park


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message