mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam B <a...@mesosphere.io>
Subject Re: Review Request 45474: MESOS-1739: Allow slave reconfiguration on restart, Phase 1.
Date Wed, 18 May 2016 19:18:30 GMT


> On March 30, 2016, 2:08 a.m., Adam B wrote:
> > src/tests/slave_tests.cpp, lines 3541-3545
> > <https://reviews.apache.org/r/45474/diff/3/?file=1318849#file1318849line3541>
> >
> >     Here you shutdown the slave and wait (you'll probably want to advance the clock
rather than wait for 90s) for the slave to be declared SLAVE_LOST. Once this occurs, the master
will no longer allow the slave to reregister with the same slaveId, and the slave will be
told to kill all running tasks. The slave will do so and then restart and register as a new
slaveId. 
> >     This is what is meant by the quote from the design doc: "Currently this can
only be handled by stopping / draining a mesos slave entirely (Killing all of its running
jobs), removing it from the cluster, then bringing it back up as a brand new slave."
> >     
> >     To truly observe this behavior, you should start a task on the slave before
you shut it down. Then you will see a TASK_LOST and the task will be killed.
> 
> Deshi Xiao wrote:
>     Thanks Adam, i will udpate the test case.
> 
> Deshi Xiao wrote:
>     @Adam B
>     Here i have a confuse,need your guide. use test case to track the TASK_LOST in restart
slave. do we expect keep the slave_id is not outdate?

Desired behavior: Operator can kill a slave process and restart it with new --attributes.
Existing tasks will continue to run. No TASK_LOST or SLAVE_LOST message is sent. The slaveId
remains the same. Outstanding offers from that slave will be rescinded, and those offers will
be remade with the updated attributes.
Current behavior 1: Operator shuts down a slave process, and restarts with --recover=cleanup,
which kills all its tasks, clears the work_dir, and notifies the master that the old slaveId
is "shutdown" and will never be reused again (SLAVE_LOST, offers rescinded, TASK_KILLED/LOST).
Operator then restarts the slave with new --attributes, it gets a new slaveId, and new offers
will be made with the new slaveId and updated attributes.
Current behavior 2: Slave process dies/killed and tries to restart with new --attributes.
Errors on recovery.
Current behavior 3: Slave process dies/killed and doesn't reregister in `slave_ping_timeout*max_slave_ping_timeouts`
(90s). Master considers it gone, sends SLAVE_LOST, TASK_LOST. Future attempts to reregister
with the same slaveId fail. Slave must be cleaned up (tasks killed, work_dir removed) so it
can register with a new slaveId (and new attributes).


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/45474/#review126066
-----------------------------------------------------------


On March 30, 2016, 1:13 a.m., Deshi Xiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/45474/
> -----------------------------------------------------------
> 
> (Updated March 30, 2016, 1:13 a.m.)
> 
> 
> Review request for mesos, Adam B, Greg Mann, haosdent huang, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-1739
>     https://issues.apache.org/jira/browse/MESOS-1739
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Phase 1
> Make SlaveInfo mutable throughout the stack, and allow for expansion of resources and
attributes only (Which allows testing to make sure it propagates to the allocator, shows up
in offers, etc). Ensure there is unified checking for incompatibilities in both the slave
and master (the slave should validate the config, the master should validate that all operations
the slave takes are legal).
> 
> it derived from another PR(https://reviews.apache.org/r/25525/)
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp 1f1a31020096efa5db698e86ac74e61dfdb4b94a 
> 
> Diff: https://reviews.apache.org/r/45474/diff/
> 
> 
> Testing
> -------
> 
> make check on localhost
> 
> 
> Thanks,
> 
> Deshi Xiao
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message