mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 48313: Creation and deletion of persistent volumes across agent restart.
Date Tue, 14 Jun 2016 08:53:01 GMT


> On June 8, 2016, 1:28 p.m., Neil Conway wrote:
> > Overall seems like a reasonable approach.
> > 
> > One thing that isn't clear to me: what is the advantage of updating the checkpoint
to reflect any partial work that was done before exiting? It seems that adds a bunch of complexity
and room for error. Why not only update the checkpoint if all changes were made successfully?
> 
> Anindya Sinha wrote:
>     We would need to maintain what was actually successful in any case since in a DESTROY,
a failed rmdir does not lead to the agent exiting. So, if we were to do it at one place, we
would still need to keep account of the successful operations so as to not update the checkpoint
based on a failed rmdir as an example (and hence can be a partial update).
>     
>     Since we are keeping track of result of the operations anyway, I think it is a good
idea to update before exiting (only place we do that when CREATE fails and the agent exits)
so that the subsequent handling of CheckpointResources does not need to redo such operations
when the agent reregisters.
> 
> Neil Conway wrote:
>     On reflection, I wonder whether we should be handling `CREATE` errors differently
from `DESTROY` errors. In both cases, the user has asked the agent to do something it wasn't
able to do. A failed `DESTROY` has the addditional concern that we might have destroyed some
but not all of the data on the volume.
>     
>     Do you think handling `CREATE` vs. `DESTROY` errors differently is a good idea?
> 
> Anindya Sinha wrote:
>     Good point. Here is what I think are the use cases:
>     Say we have checkpointed resources (CR) as {V1,V2} where V1, V2 are 2 persistent
volumes. So, CR(master) = {V1,V2}, and CR(agent) = {V1,V2}.
>     If we now receive a DESTROY(V2): CR(master) = {V1} [since master's view is not dependent
on what happened on the agent]. Suppose that fails on the agent, so CR(agent) = {V1,V2} [since
we do not update checkpoint resources on agent on failure in DESTROY, which results in inconsistency
between master and agent at this point of time].
>     
>     Case 1 (current implementation): Agent does not restart on failure in DESTROY. Hence,
CR(agent) = {V1,V2}. When the next CheckpointResources is received, ie. on a subsequent CREATE/DESTROY/RESERVE/UNRESERVE
on a different resource, DESTROY(V2) will be reattempted and if that is successful, we will
in sync between agent and master. However if the next CheckpointResources is due to a CREATE(V2)
[that can happen since V2 is available as a resource based on offer from master], that would
be a no-op on agent since agent does not treat it as a new resource based on the checkpoint
since at this point CR(master) = {V1,V2}, and CR(agent) = {V1,V2}, which would be a problem.
>     
>     Case 2 (if we exit agent on failure): The agent restarts which triggers a CheckpointResources
from master->agent on ReregisterSlave. That would force a reattempt of DESTROY(V2) since
current view is CR(master) = {V1} and CR(agent) = {V1,V2} which will reattempt to bring the
checkpointed resources back in sync between master and agent.
>     
>     So, I think it might be a better option to exit the agent on failure in DESTROY as
well. However, I think we should still update the checkpoint based on the status of successful
operations (other CREATE/DESTROY) on failure (when agent exits) so as to avoid these operations
to be repeated in a subsequent CheckpointResources message. Does that sound reasonable to
you?
>     
>     Note: I think this use case probably is a good example to consider StatusUpdates
(or something similar) for operations on reserved resources, viz. CREATE/DESTROY/RESERVE/UNRESERVE
to ensure master and agent are in sync to ensure guaranteed view of offers (to frameworks)
for reserved resources.
> 
> Neil Conway wrote:
>     Thanks for the writeup! Makes sense.
>     
>     If we cause the agent to exit on DESTROY failure, if we only re-apply DESTROYs at
`ReregisterSlave`, it seems to me there is still a window in which another `CREATE` can be
applied at the master. That would mean we wouldn't reapply the `DESTROY`, which would be bad.
>     
>     My concern about updating the checkpointed resources to reflect *partial* results
is that (a) it seems unnecessary, (b) it means the checkpointed resources don't reflect either
the *original* or the *desired* state of the agent, which seems problematic.
>     
>     What about the following: when an agent gets a `CheckpointedResourcesMessage` that
differs from its current state, it first writes the desired resource state to a new checkpoint
file, `checkpoint.target`. Then it tries to apply the delta between `checkpoint.target` and
the current checkpoint. If the on-disk state at the agent is updated successfully to match
`checkpoint.target`, we rename `checkpoint.target` -> `checkpoint` and we're done. Otherwise,
if any I/O operation fails, we exit the agent. Then on restart, we check for the existence
of `checkpoint.target` *before* we try to reregister with the master; if `checkpoint.target`
!= `checkpoint`, we resume trying to apply the delta between `checkpoint.target` and `checkpoint`;
if that fails we exit again, and if it succeeds we rename `checkpoint.target` -> `checkpoint`
and then continue to reregister with the master.
>     
>     Let me know what you think.
> 
> Anindya Sinha wrote:
>     > If we cause the agent to exit on DESTROY failure, if we only re-apply DESTROYs
at ReregisterSlave, it seems to me there is still a window in which another CREATE can be
applied at the master. That would mean we wouldn't reapply the DESTROY, which would be bad.
>     
>     Agreed, but I think we agree that the window of failure is very small. The agent
needs to fail during processing of a `CheckpointResourcesMessage` pertaining to a DELETE,
an offer goes out to a framework with the disk space as available, and a CREATE for the same
volume and the same persistence id is received prior to the successful reregister of the agent.
>     
>     > What about the following: when an agent gets a CheckpointedResourcesMessage
that differs from its current state, it first writes the desired resource state to a new checkpoint
file, checkpoint.target. Then it tries to apply the delta between checkpoint.target and the
current checkpoint. If the on-disk state at the agent is updated successfully to match checkpoint.target,
we rename checkpoint.target -> checkpoint and we're done. Otherwise, if any I/O operation
fails, we exit the agent. Then on restart, we check for the existence of checkpoint.target
before we try to reregister with the master; if checkpoint.target != checkpoint, we resume
trying to apply the delta between checkpoint.target and checkpoint; if that fails we exit
again, and if it succeeds we rename checkpoint.target -> checkpoint and then continue to
reregister with the master.
>     
>     This would work as far as retrying DESTROY on the agent is concerned; however, this
would still continue to offer resources to frameworks that are not available. If DESTROY fails
(continuously even after multiple retries), the master has already offered that resource as
available to the framework which seems incorrect. I think we need a longer term discussion
on something similar to StatusUpdates for reserved resources (agent->master->framework)
so that such a resource is offered to the framework only after the agent successfully processes
it.
>     
>     I will post an update to this RR to address the remaining comments, and modify it
as appropriate as we decide based on the current thread.

> I think we agree that the window of failure is very small.

The window is indeed small, but it is still possible. If we can eliminate the window of failure
entirely without too much complexity, I think that is worth investigating.

> If DESTROY fails (continuously even after multiple retries), the master has already offered
that resource as available to the framework which seems incorrect.

Indeed -- although the master can offer resources that are unavailable in general (e.g., master
offers resource at an agent and concurrently the agent crashes; task launches will return
TASK_LOST). I completely agree that we want something akin to status updates + reconciliation
for reliably making reservations and dynamic volumes -- +1 to the longer-term discussion.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48313/#review136638
-----------------------------------------------------------


On June 13, 2016, 9:03 p.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48313/
> -----------------------------------------------------------
> 
> (Updated June 13, 2016, 9:03 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5448
>     https://issues.apache.org/jira/browse/MESOS-5448
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> o Checkpoints on the agent are updated only after successful handling
>   of persistent volume creation and deletion to maintain consistency.
> o If volume creation or deletion fails, checkpoint is updated up until
>   that point, and the agent exits.
> o This ensures that after a agent restart, checkpoints are in sync
>   between the master and the agent after the reregistration workflow.
> 
> 
> Diffs
> -----
> 
>   include/mesos/resources.hpp a557e97c65194d4aad879fb88d8edefd1c95b8d8 
>   include/mesos/v1/resources.hpp a5ba8fec4c9c3643646308f75a4b28cefe0b3df3 
>   src/common/resources.cpp f6ff92b591c15bc8e93fd85e1896349c3a7bb968 
>   src/slave/slave.cpp 0ccc56f153024ff45d2fd1f25c79c7bb6ed7120e 
>   src/v1/resources.cpp 8c3f2d1c1529915a59d47fe37bb3fc7a3267079a 
> 
> Diff: https://reviews.apache.org/r/48313/diff/
> 
> 
> Testing
> -------
> 
> All tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message