flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-8431) Allow to specify # GPUs for TaskManager in Mesos
Date Tue, 30 Jan 2018 02:01:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344388#comment-16344388

ASF GitHub Bot commented on FLINK-8431:

Github user eastcirclek commented on the issue:

    As you pointed out, the discussion we had in the mailing list was about JM not starting
TMs on GPU-equipped agents. It turned out that a Mesos framework needs to specify a `GPU_RESOURCES`
capability if it wants to get resource offers that contain GPUs [[link]](http://mesos.apache.org/documentation/latest/gpu-support/#framework-capabilities).
I managed to start TMs on the GPU-equipped agents by specifying a master flag `--fliter_gpu_resources`
when starting the Mesos master. [MESOS-7576](https://issues.apache.org/jira/browse/MESOS-7576)
introduces `--filter_gpu_resources` and, when the flag is set to false, Mesos frameworks that
do not have `GPU_RESOURCES` capability can receive offers that contain GPUs from the Mesos
master. The problem seemed to be figured out without modifying Flink. 
    The reason I create [FLINK-8431](https://issues.apache.org/jira/browse/FLINK-8431) to
allow to specify # gpus is that TMs are not going to see GPUs if they do not request GPUs
explicitly and GPUs are isolated as shown in [link](http://mesos.apache.org/documentation/latest/gpu-support/#agent-flags).
    Regarding your question,
    > Is the original problem which we want to solve that Flink does not use agents which
have GPU resources or that Flink cannot specify the number of GPUs it requires to run? It
looks as if the PR solves the latter ...
    Yes, the scope of FLINK-8431 and this PR is confined to the latter.
    > but I was wondering whether we shouldn't solve the former problem.
    I don't think we need to take care of the former anymore because `GPU_RESOURCES` is going
to be deprecated in favor of the reservation mechanism as shown in [link](https://www.mail-archive.com/dev@mesos.apache.org/msg37571.html)
and [MESOS-7576](https://issues.apache.org/jira/browse/MESOS-7576). Thus, we need not split
servers into two categories (CPU-only servers and GPU-equipped servers) anymore. Nevertheless,
we need to specify `GPU_RESOURCES` until it is completely deprecated in Mesos-2.x. To this
end, I add a `GPU_RESOURCES` capability if # gpus are larger than 0.
    For those who are in a situation in which JM does not get offers that contains GPUs, I'd
like to suggest to restart the Mesos master with `--filter_gpu_resources` set to false as
explained above.

> Allow to specify # GPUs for TaskManager in Mesos
> ------------------------------------------------
>                 Key: FLINK-8431
>                 URL: https://issues.apache.org/jira/browse/FLINK-8431
>             Project: Flink
>          Issue Type: Improvement
>          Components: Cluster Management, Mesos
>            Reporter: Dongwon Kim
>            Assignee: Dongwon Kim
>            Priority: Minor
> Mesos provides first-class support for Nvidia GPUs [1], but Flink does not exploit it
when scheduling TaskManagers. If Mesos agents are configured to isolate GPUs as shown in [2],
TaskManagers that do not specify to use GPUs cannot see GPUs at all.
> We, therefore, need to introduce a new configuration property named "mesos.resourcemanager.tasks.gpus"
to allow users to specify # of GPUs for each TaskManager process in Mesos.
> [1] http://mesos.apache.org/documentation/latest/gpu-support/
> [2] http://mesos.apache.org/documentation/latest/gpu-support/#agent-flags

This message was sent by Atlassian JIRA

View raw message