mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jie Yu <yujie....@gmail.com>
Subject Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.
Date Thu, 01 Feb 2018 23:35:06 GMT


> On Feb. 1, 2018, 10:32 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/main.cpp
> > Lines 40-50 (patched)
> > <https://reviews.apache.org/r/65465/diff/1/?file=1951378#file1951378line40>
> >
> >     Flying by. Why this logic is not in launch.cpp? Sounds to me it's unrelated
to, for example, Mount below?
> 
> Andrew Schwartzmeyer wrote:
>     Where in `launch.cpp` would you put it? The handle needs to exist for exactly as
long as the process exists (or as close as we can get, which putting it here gets it really
close).

well, i don't think putting here or in launch.cpp has any noticible difference in terms of
"closeness" (probably a dozen of instructions?).

my question is: is this logic only related to the launch of a container or not? If yes, this
should be moved to launch.cpp (i.e., `MesosContainerizerLaunch::execute()`).


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review196662
-----------------------------------------------------------


On Feb. 1, 2018, 7:57 p.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 1, 2018, 7:57 p.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-8519
>     https://issues.apache.org/jira/browse/MESOS-8519
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/main.cpp a53ccd68bf975d919f9d1f920cf3fa74d4e43f24 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/1/
> 
> 
> Testing
> -------
> 
> ```
> [----------] Global test environment tear-down
> [==========] 874 tests from 85 test cases ran. (253311 ms total)
> [  PASSED  ] 874 tests.
> 
> I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c'
of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status
update manager
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c'
of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
> I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f
for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
> I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c'
of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
> I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor
'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
> I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
> I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task
status updates
> I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register
without authentication
> I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
> I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task
status updates
> I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid"
{"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
> I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task
status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message