mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Review Request 46187: Terminate when receiving the ACK of terminal status update.
Date Thu, 25 Aug 2016 18:04:47 GMT


> On April 14, 2016, 4:49 p.m., Vinod Kone wrote:
> > src/launcher/http_command_executor.cpp, line 749
> > <https://reviews.apache.org/r/46187/diff/1/?file=1343828#file1343828line749>
> >
> >     Looking at slave::statusUpdate() code there are several scenarios where the
slave ignores a status update sent by the executor; this means this executor could end up
not terminating forever.
> >     
> >     Can you do the following:
> >     
> >     --> Enque a message in the queue to self terminate after some timeout (you
can use the delay() function) to be safe.
> >     
> >     --> Add a TODO that we do this to be safe and also because slave sometimes
doesn't ACK a status update. Link to a ticket that fixes the slave status update semantics
to always ACK a status update sent by an executor.
> >     
> >     sounds good?
> 
> Vinod Kone wrote:
>     @Qian, any update on this? If this particular review is going to take some time,
I think it is still useful two commit the other 2 reviews in this chain. AFAICT, they are
independent of this review?
> 
> Qian Zhang wrote:
>     @Vinod, sorry for the late. I have filed a ticket (https://issues.apache.org/jira/browse/MESOS-5262)
for enhancing `slave::statusUpdate()` to always ACK the status update sent by executor.
>     
>     And can you please elaborate about the specific scenarios this executor could not
terminate forever. Originially I thought the scenario should be: executor sends a terminal
status upate to slave when the corresponding framework is in `TERMINATING` state (e.g., operator
tears down the framework), then in `Slave::statusUpdate()`, this status update will be ignored,
so the executor will not get the ACK. But after testing, I found in this case the executor
can still terminate, because the container corresponded to this executor will be destroyed
by `Slave::shutdownExecutorTimeout()` -> `MesosContainerizer::destroy()`, so after `--executor_shutdown_grace_period`,
the executor can still terminate. So I am not in which case the executor will never terminate.
>     
>     And yes, the other 2 patches are independent of this one, I will make them not depending
on this one in the review board, thanks!
> 
> Qian Zhang wrote:
>     After more thinking, I see one scenario the executor could never terminate is: agent
is down right after it sends SHUTDOWN event to executor. In this case, when handling the SHUTDOWN
event, executor will kill the task and send TASK_KILLED status update to agent, but it will
not get ACK since agent is already down, so the executor will still run. But I think once
agent is started again, executor will receive the ACK and then terminate. I am not sure if
this behavior is OK, or we want executor to terminate once it receives the SHUTDOWN event
rather than wait for agent is started again?
> 
> Vinod Kone wrote:
>     I think it is the right thing for the executor to stay up until the agent comes back
up and sends an ACK. Otherwise, which is currently the case, the agent has no idea about the
exit status of the executor.
>     
>     So the cases where I see no ACKs from the agent are:
>     
>     1) `update` doesn't have UUID.
>       --> Is this possible with the HTTP executor? If not, we should probably shutdown
the executor instead of just ignoring.
>     
>     2) Framework is unknown.
>       --> Don't think we can do much here because we don't have access to Executor
object?
>     
>     3) Framework is terminating.
>       --> Per your observation the agent is guaranteed to destroy the container. We
need to add a comment here explainining this.
>     
>     4) Executor is unknown. While we forward the status update to the framework, the
ACK is dropped by `_statusUpdate` and `___statusUpdate()`
>        --> If the update is generated by the agent (`killTask()`, `_runTask()`) we
should be fine because there is no executor.
>        --> If the update is sent by an executor for a task belonging to another executor,
that another executor is hopefully already dead. So we should be fine.
>        --> Is it possible for the update to be sent by an executor that is not known
to the agent? Don't think we can do much here since we don't have access to the Executor object.
> 
> Qian Zhang wrote:
>     > I think it is the right thing for the executor to stay up until the agent comes
back up and sends an ACK. Otherwise, which is currently the case, the agent has no idea about
the exit status of the executor.
>     
>     But that seems conflict with your original comment. My understanding to your original
comment is, to avoid executor not terminating forever, executor needs to enque a message to
self terminate after some timeout (I think we can do it via delay() in `reaped()` after terminal
status update is sent), if so, then when executor handles SHUTDOWN event, it will self terminate
anyway during agent down, so it is not possible to make executor stay up until agent comes
back.
>     
>     With this patch, I think the behavior of HTTP commmand executor and driver based
command executor will be different, let's see this scenario: a checkpoint-enabled frameworks
launches a "sleep 100" task, after the task is running, agent is down, and during agent down,
the task finishes, and then agent is started again. In this case, after agent is started again,
for driver-based command executor, framework will receive a TASK_FAILED since the executor
has terminated during agent is down, but for HTTP command executor (with this patch applied),
framework will receive a TASK_FINISHED since the executer will NOT terminate until it receives
the ACK for the TASK_FINISHED.
>     
>     I think we need to make them have consistent behavior. Any suggestions? :-)
>     
>     > 1) update doesn't have UUID.
>       --> Is this possible with the HTTP executor? If not, we should probably shutdown
the executor instead of just ignoring.
>     
>     This is not possible with HTTP command executor, because in `HttpCommandExecutor::update()`
we always set UUID in the status upate before sending it to agent.
>     
>     > 2) Framework is unknown.
>       --> Don't think we can do much here because we don't have access to Executor
object?
>     
>     I do not think this will happen in our case, because framework will only be removed
(`Slave::removeFramework()`) when it has no executor (`framework->executors.empty()`),
but in our case, it must have one executor (the HTTP command executor).
>     
>     
>     > 4) Executor is unknown. While we forward the status update to the framework,
the ACK is dropped by _statusUpdate and ___statusUpdate()
>     
>     >   --> If the update is generated by the agent (killTask(), _runTask()) we
should be fine because there is no executor.
>     
>     >   --> If the update is sent by an executor for a task belonging to another
executor, that another executor is hopefully already dead. So we should be fine.
>     
>     >   --> Is it possible for the update to be sent by an executor that is not
known to the agent? Don't think we can do much here since we don't have access to the Executor
object.
>     
>     I do not think it is possible that an executor sends a update to agent but agent
does not know this executor.

There might be no cases today where executor doesn't get an ACK, but I would still prefer
for us to be safe here incase there is a bug in the slave now or in the future. So lets add
a timeout to self terminate with a comment saying that we do that to be safe. The executor
can exit with non-zero status so that we can catch such situations.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46187/#review128916
-----------------------------------------------------------


On April 27, 2016, 1:23 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46187/
> -----------------------------------------------------------
> 
> (Updated April 27, 2016, 1:23 a.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-4855
>     https://issues.apache.org/jira/browse/MESOS-4855
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Terminate when receiving the ACK of terminal status update.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp 0b4136c040ec9cf585c0d38f8471495a855cd640 
> 
> Diff: https://reviews.apache.org/r/46187/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message