mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Yan Xu <...@jxu.me>
Subject Re: Review Request 47082: LostSlaveMessage should be sent to affected frameworks only.
Date Mon, 06 Jun 2016 05:43:48 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47082/#review135384
-----------------------------------------------------------



See the comment I had on the tests: I think we should revisit the semantics for this message
(when it should be sent).

A few more points:
1. Is there a test that covers "the basic flow": `slaveLost()` is being called because it
does have reserevd resources or tasks on the agent (in the same test we can have a idle framework
which doesn't get notified)?
2. Since this changes Mesos behavior observable by the frameworks, we should update documentation
for scheduler API `slaveLost()` to reflect this.
3. For the reason in 2), we should send an email to the dev and user list about this change
before we commit it.


src/master/master.hpp (lines 321 - 323)
<https://reviews.apache.org/r/47082/#comment200397>

    1. Would it make sense to use `multihashmap` to avoid the manual cleanups of empty `hashset`
etc.?
    2. Just so it's clearer to the readers let's comment on why we need to track them here.
    
    Perhaps
    
    ```
      // Tasks that have not yet been launched because they are currently
      // being authorized. Similar to 'pendingTasks' in Framwork but we 
      // track them separately here to determine which frameworks are
      // affected when an agent is removed.
    ```



src/master/master.cpp (line 3455)
<https://reviews.apache.org/r/47082/#comment201098>

    The **what** in this comment feels unnecessary as the code is self-explanatory. If we
add some comment here perhaps comment on **why**.
    
    Maybe: 
    
    ```
    // Add to the slave's list of pending tasks. This is
    // necessary to track frameworks who own any pending
    // task affected by a lost agent.
    ```



src/master/master.cpp (lines 3456 - 3459)
<https://reviews.apache.org/r/47082/#comment201094>

    If we are using `multihashmap` this could be:
    
    ```
    if (!slave->pendingTasks.contains(framework->id(), task.task_id()) {
      slave->pendingTasks.put(framework->id(), task.task_id());
    }
    ```



src/master/master.cpp (lines 3838 - 3841)
<https://reviews.apache.org/r/47082/#comment200412>

    We can keep the original comment which can cover both. No need to comment separately since
the statements here are pretty short and easy to understand.



src/master/master.cpp (line 4119)
<https://reviews.apache.org/r/47082/#comment200413>

    Let's use `slave != nullptr` for consistentcy with the rest of the file. (Note that MESOS-3243
was recently committed)



src/master/master.cpp (line 6577)
<https://reviews.apache.org/r/47082/#comment201142>

    s/if/if there are/
    
    So the bullet points are more consistently constructed.



src/master/master.cpp (line 6578)
<https://reviews.apache.org/r/47082/#comment201143>

    ```
    1. Reserved resources on this slave that belong to the framework's role; or
    ```
    
    Note that there are also statically reserved resources that are on this slave which aren't
checkpointed.



src/master/master.cpp (line 6579)
<https://reviews.apache.org/r/47082/#comment201144>

    ```
    2. Tasks on this slave launched by the framework; or
    ```
    
    We should avoid using `running` to avoid confusing because it may not be `RUNNING` yet
on the slave. We can just call them either tasks or pending tasks.



src/master/master.cpp (line 6580)
<https://reviews.apache.org/r/47082/#comment201147>

    nit: s/on/for/
    
    (as in destined for) because it's not added to/launched onto the slave yet.



src/master/master.cpp (line 6581)
<https://reviews.apache.org/r/47082/#comment201154>

    



src/master/master.cpp (line 6590)
<https://reviews.apache.org/r/47082/#comment201155>

    We should look through `slave->totalResources.reserved()` which includes statically
reserved resources.



src/master/master.cpp (lines 6591 - 6593)
<https://reviews.apache.org/r/47082/#comment201139>

    Per the comment above, this is not needed now.



src/master/master.cpp (line 6627)
<https://reviews.apache.org/r/47082/#comment201157>

    s/there are running/it has/



src/master/master.cpp (line 6636)
<https://reviews.apache.org/r/47082/#comment201159>

    No need to copy, you are not modifying `slave->pendingTasks` while iterating.
    
    Also align it like this
    
    ```
    foreachkey (const FrameworkID& frameworkId,
                slave->pendingTasks) {
    }
    ```
    
    if it exceeded 80 chars (which looks like it didn't so it should be put on the same line).



src/master/master.cpp (line 6639)
<https://reviews.apache.org/r/47082/#comment201161>

    It's naturally taken care of when `Slave` is destructed right?
    
    Plus always use an empty line between `}` and the next statement.



src/master/master.cpp (line 6651)
<https://reviews.apache.org/r/47082/#comment201166>

    s/lost slave/the lost slave/.



src/master/master.cpp (lines 6666 - 6667)
<https://reviews.apache.org/r/47082/#comment201165>

    ```
    // Add the frameworks that have pending inverse offers from this 
    // slave for notification for the lost slave.
    ```
    
    i.e.,
    
    - s/lost slave/the lost slave/
    - Also, while 70 char limit for comments is not a hard rule but consider using it if it
makes things look less jagged.



src/tests/master_tests.cpp (line 1760)
<https://reviews.apache.org/r/47082/#comment201248>

    So the following two tests actually caught something we didn't anticipate, so instead
of "fixing" the tests, we should fix our code:
    
    The scheduler driver remembers the slave pids for framework messages and old entries are
removed when it receives `LostSlaveMessage`s. The way we are doing it right now can cause
them to be nevered removed because they don't receive the `LostSlaveMessage`s!
    
    Note that there are two cases:
    
    1. If the master fails over, it doesn't know anything about the lost agent so it doesn't
know whehter it should send `LostSlaveMessage`s but it perahps should.
    2. If the master doesn't fail over and a framework's tasks have all completed before the
agent goes lost, it doesn't send `LostSlaveMessage` to it but then the save pid is not erased
on the driver.
    
    Let's chat about this further.



src/tests/master_tests.cpp (lines 1956 - 1957)
<https://reviews.apache.org/r/47082/#comment201174>

    


- Jiang Yan Xu


On May 26, 2016, 4:56 p.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47082/
> -----------------------------------------------------------
> 
> (Updated May 26, 2016, 4:56 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5143
>     https://issues.apache.org/jira/browse/MESOS-5143
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a slave is removed, master sends a LostSlaveMessage to affected
> frameworks only (instead of all registered frameworks). An affected
> framework is a framework which satisfied one or more conditions of
> the following:
> 
> 1. There are running tasks on this slave belonging to the framework.
> 2. There are pending tasks on this slave belonging to the framework.
> 3. Reserved resources on the slave have a matching role with the
>    role of the framework.
> 4. There are pending offers or pending inverse offers from this slave
>    for the framework.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 1a875c32eddfb6d884e3d0dda7f5716ee53966c3 
>   src/master/master.cpp 0005a29caabcc6a3776037cf86a2b12660e6377b 
>   src/tests/master_tests.cpp 34be015aa314a7574e9065efb7b1bb8e1570c5b7 
> 
> Diff: https://reviews.apache.org/r/47082/diff/
> 
> 
> Testing
> -------
> 
> All existing and modified tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message