mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anindya Sinha <anindya_si...@apple.com>
Subject Re: Review Request 48313: Creation and deletion of persistent volumes across agent restart.
Date Mon, 20 Jun 2016 23:42:01 GMT


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, line 2513
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2513>
> >
> >     The first argument is already a member variable and doesn't need to be passed
around right?
> >     
> >     It's unclear from the method name whether the "sync" implies syncing the two
arguments or syncing the disk state and `newCheckpointedResources`. Also it's a bit unintuitive
that the resources info file is checkpointed in the method but member variable `checkpointedResources`
is assigned outside instead.
> >     
> >     Would the following semantics be clearer?
> >     
> >     1. Use this method to do one thing: bring the filesystem in sync with the newCheckpointedResources
and return an Error if anything goes wrong.
> >     
> >     ```
> >     Try<Nothing> syncCheckpointedResources(const Resources& newCheckpointedResources);
> >     ```
> >     
> >     2. On the caller side: if the result is an error, EXIT. Otherwise checkpoint
the resource info file and remove the target file. Note that we need to do similar things
in `recover()` but in there you would fail the future instead of directly EXIT.

We can avoid passing `checkpointedResources` in `syncCheckpointedResources()` when called
from `Slave::checkpointResources(const vector<Resource>& _checkpointedResources)`
but cannot avoid when called from `Future<Nothing> Slave::recover(const Result<state::State>&
state)`. Hence we are passing the current checkpointed info as well (as derived from resources.info
in `Slave::recover()`).

+1 for rest of the comments in terms of scope of functionality of `syncCheckpointedResources()`.


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, lines 2552-2553
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2552>
> >
> >     Can we directly return in case of an error instead of using variables like this?

There are 3 error conditions here so to consolidate return from one place which I believe
is better than return with the same failures from 3 places... hence, the variables. I can
certainly consolidate `isFailed` and `additionalReason` into `Option<string> additionalReason
= None()` to capture any errors (and not need `isFailed` in this case).

I just realized this RR has 1 condition only, but when the changes for RR 48315 is added,
there are 3 conditions. Let me know if you want me to remove `additionalReason` from this
RR and add it in RR 48315. Again, all the RRs 48313 through to 48315 go together.


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, lines 2520-2521
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2520>
> >
> >     If this fails, we don't know how it has failed and if the agent crashes next
and restarts, it's possible for the agent to recover some non-empty resources from the target
and erroneously starts deleting directories. Possible?

is it possible? `os::rm()` for files calls `unlink()` which clears the inode entry for that
file (assuming no other links to that file), although the data may still exist on disk (until
another write operation overwrites the actual bits). And isn't clearing the inode entry an
atomic operation? So I do not think there is a case where the data is partial.


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, lines 2583-2591
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2583>
> >
> >     This comment is supposed to be revised together with /r/48315/?

Since RRs 48313 through to 48315 go together, do  you want to change this comment in this
RR?


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, line 4742
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line4742>
> >
> >     We need to differentiate between the target being empty (everthing is unreserved)
and there's no target (changes are processed and committed) right?

Yes, good catch. If `resources.target` is absent, then no sync is needed (which is captured
if `resourcesState.get().target.isSome()` is `false`).
If `resources.target` is present and `targetResources != resources`, we need to sync. So,
checking `!targetResources.empty()` is not necessary since it might mean that all resources
need to be unreserved if `resources.empty()` is `false`.


- Anindya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48313/#review138314
-----------------------------------------------------------


On June 20, 2016, 11:41 p.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48313/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 11:41 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5448
>     https://issues.apache.org/jira/browse/MESOS-5448
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the agent receives CheckpointedResourcesMessage, we store the
> target checkpoint on disk. On successful create and destroy of
> persistent volumes as a part of handling this messages, we commit
> the checkpoint on the disk, and clear the target checkpoint.
> 
> However, incase of any failure we do not commit the checkpoint to
> disk, and exit the agent. When the agent restarts and there is a
> target checkpoint present on disk which differs from the committed
> checkpoint, we retry to sync the target and committed checkpoint.
> On success, we reregister the agent with the master, but in case it
> fails, we do not commit the checkpoint and the agent exists.
> 
> 
> Diffs
> -----
> 
>   src/slave/paths.hpp 339e539863c678b6ed4d4670d75c7ff4c54daa79 
>   src/slave/paths.cpp 03157f93b1e703006f95ef6d0a30afae375dcdb5 
>   src/slave/slave.hpp 58ff2bfac6918d989ab36b67cf6ba2f3657c8356 
>   src/slave/slave.cpp 4bf01f2b020f5e975fb57cffcd19865d7431eac2 
>   src/slave/state.hpp 0de2a4ee4fabaad612c4526166157b001c380bdb 
>   src/slave/state.cpp 04c3d42040f186507a0e484a3ee616f1b1a77ea8 
> 
> Diff: https://reviews.apache.org/r/48313/diff/
> 
> 
> Testing
> -------
> 
> All tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message