mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Klues <klue...@gmail.com>
Subject Re: Review Request 44366: Added GPUs as an explicit resource.
Date Mon, 14 Mar 2016 07:39:53 GMT


> On March 9, 2016, 1:14 a.m., Ben Mahler wrote:
> > Could you update the description to reflect the new state of the code?

Done.


> On March 9, 2016, 1:14 a.m., Ben Mahler wrote:
> > src/slave/containerizer/containerizer.cpp, line 96
> > <https://reviews.apache.org/r/44366/diff/1/?file=1280284#file1280284line96>
> >
> >     Can we also prevent fractional GPUs here?
> >     
> >     We'd also need to do fractional validation in the launch task path, inside `update`
(as we discussed `prepare` is calling update so no need to stick it in both functions).

I've added code to do the check here.  We will need to remember to add it into the `prepare`
function once we fill that function in.


> On March 9, 2016, 1:14 a.m., Ben Mahler wrote:
> > src/slave/containerizer/containerizer.cpp, lines 97-102
> > <https://reviews.apache.org/r/44366/diff/1/?file=1280284#file1280284line97>
> >
> >     Since it's unlikely we'll have a default number of GPUs in the future, perhaps
we just have a comment here that describes why we don't use a default?

Removed and comment added.


> On March 9, 2016, 1:14 a.m., Ben Mahler wrote:
> > src/slave/containerizer/containerizer.cpp, line 114
> > <https://reviews.apache.org/r/44366/diff/1/?file=1280284#file1280284line114>
> >
> >     Needs to be in the ifdef, right?

Done


> On March 9, 2016, 1:14 a.m., Ben Mahler wrote:
> > src/slave/containerizer/containerizer.cpp, lines 114-135
> > <https://reviews.apache.org/r/44366/diff/1/?file=1280284#file1280284line114>
> >
> >     Can you remove the periods at the end of the error messages?

Done


> On March 9, 2016, 1:14 a.m., Ben Mahler wrote:
> > src/slave/containerizer/containerizer.cpp, lines 115-118
> > <https://reviews.apache.org/r/44366/diff/1/?file=1280284#file1280284line115>
> >
> >     Our pattern for error messages is to avoid the periods and multi-sentence messages.
We'll try to write messages that can compose as follows:
> >     
> >     Failed to <action>: <reason>
> >     
> >     Where reason can also correspond to a composed message, so you can end up with
things like:
> >     
> >     Failed to launch container: Failed to prepare container: Fractional 'gpus' not
supported.
> >     
> >     So when we've needed multi-sentence messages, we've tended to use semi-colons.

These errors have been reduced to single sentences now, so this is no longer an issue. Good
to keep in mind though.


> On March 9, 2016, 1:14 a.m., Ben Mahler wrote:
> > src/slave/containerizer/containerizer.cpp, line 118
> > <https://reviews.apache.org/r/44366/diff/1/?file=1280284#file1280284line118>
> >
> >     These should be plural, so 'gpus'

Fixed.


> On March 9, 2016, 1:14 a.m., Ben Mahler wrote:
> > src/slave/containerizer/containerizer.cpp, lines 121-130
> > <https://reviews.apache.org/r/44366/diff/1/?file=1280284#file1280284line121>
> >
> >     We can get rid of the parsing code here based on the suggestion in the previous
review to add to common/parse.hpp.
> >     
> >     Note that we would usually use const& in the loop here.

Its gone now. The check below it still remains, but now it just checks the size of the array
from the flag directly (since the flag is a `vector<unsigned int>`)


> On March 9, 2016, 1:14 a.m., Ben Mahler wrote:
> > src/v1/resources.cpp, line 1238
> > <https://reviews.apache.org/r/44366/diff/1/?file=1280285#file1280285line1238>
> >
> >     Don't feel obligated, but would be nice to do a blanket s/`.get().`/`->`/
cleanup in the resources / values files, separate from this patch.

I'll add it as a pre-patch to this patch set. That way the new GPU resource that's being added
will never do it the "wrong" way.

Turns out there is nothing to be done in the values files, just the resources files. Also,
doing a quick grep, it seems like we still do things the "wrong" way all over the place. 
Might be nice to do a pass of the full source tree and kill all birds with one stone at some
point.


- Kevin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44366/#review122638
-----------------------------------------------------------


On March 14, 2016, 7:39 a.m., Kevin Klues wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44366/
> -----------------------------------------------------------
> 
> (Updated March 14, 2016, 7:39 a.m.)
> 
> 
> Review request for mesos, Ben Mahler, Rob Todd, and Vikrama Ditya.
> 
> 
> Bugs: MESOS-4865
>     https://issues.apache.org/jira/browse/MESOS-4865
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, we enforce that the number of GPUs specified in the 'gpus'
> resource parameter equal the number of GPUs passed in via the
> --nvidia_gpu_devices flag. In the future, we will generalize this via
> autodiscovery of GPUs and support for GPU types other than Nvidia.
> 
> 
> Diffs
> -----
> 
>   include/mesos/resources.hpp bb343ad852576a75615a93ef850b413bf77698e0 
>   include/mesos/v1/resources.hpp 719110fbbf39f1755460ac0b32e3893656054a4e 
>   src/common/resources.cpp 1f23dc83f7330c305a836d698f114b7eaf3d7ba1 
>   src/slave/containerizer/containerizer.cpp f6fc7863d0c215611f170dc0c89aa229407b5137

>   src/v1/resources.cpp c6f125ec317e2da537a6456f5cff2da0a48701d8 
> 
> Diff: https://reviews.apache.org/r/44366/diff/
> 
> 
> Testing
> -------
> 
> ./bin/mesos-slave.sh --master=127.0.0.1:5050 --resources="gpus:string"
> Failed to determine slave resources: Bad type for resource gpus value string type TEXT
> 
> ./bin/mesos-slave.sh --master=127.0.0.1:5050 --resources="gpus:4.9"
> Failed to determine slave resources: The `gpus` resource must specified as an unsigned
integer
>    
> ./bin/mesos-slave.sh --master=127.0.0.1:5050 --resources="gpus:4.0"
> Failed to determine slave resources: When specifying the `gpus` resource, you must also
specify a list of GPUs via the `--nvidia_gpu_devices` flag
> 
> ./bin/mesos-slave.sh --master=127.0.0.1:5050 --resources="gpus:4.0" --nvidia_gpu_devices=1,2,3
> Failed to determine slave resources: The number of GPUs passed in the `--nvidia_gpu_devices`
flag must match the number of GPUs specified in the `gpus` resource
> 
> ./bin/mesos-slave.sh --master=127.0.0.1:5050 --resources="gpus:4.0" --nvidia_gpu_devices=1,2,3,4
> SUCCESS
> 
> **NOTE**: I didn't set the `--isolation` flag here, so the agent actually started up
correctly (i.e. it didn't error out saying that the Nvidia GPU isolator is currently unsupported).
 This is the correct behaviour, and it properly exercises the code added in this patch.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message