hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keqiu Hu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-9294) Potential race condition in setting GPU cgroups & execute command in the selected cgroup
Date Mon, 11 Feb 2019 01:12:00 GMT
Keqiu Hu created YARN-9294:
------------------------------

             Summary: Potential race condition in setting GPU cgroups & execute command
in the selected cgroup
                 Key: YARN-9294
                 URL: https://issues.apache.org/jira/browse/YARN-9294
             Project: Hadoop YARN
          Issue Type: Bug
          Components: yarn
    Affects Versions: 2.10.0
            Reporter: Keqiu Hu
            Assignee: Keqiu Hu


Environment is latest branch-2 head

OS: RHEL 7.4

*Observation*
Out of ~10 container allocations with GPU requirement, at least 1 of the allocated containers
would lose GPU isolation. Even if I asked for 1 GPU, I could still have visibility to all
GPUs on the same machine when running nvidia-smi.

The funny thing is even though I have visibility to all GPUs at the moment of executing container-executor
(say ordinal 0,1,2,3), but cgroups jailed the process's access to only that single GPU after
sometime. 

The underlying process trying to access GPU would take the initial information as source of
truth and try to access physical 0 GPU which is not really available to the process. This
results in a [CUDA_ERROR_INVALID_DEVICE: invalid device ordinal] error.

Validated the container-executor commands are correct:

{code:java}
PrivilegedOperationExecutor command: [/export/apps/hadoop/nodemanager/latest/bin/container-executor,
--module-gpu, --container_id, container_e22_1549663278916_0249_01_000001, --excluded_gpus,
0,1,2,3]

PrivilegedOperationExecutor command: 
[/export/apps/hadoop/nodemanager/latest/bin/container-executor, khu, khu, 0, application_1549663278916_0249,
/grid/a/tmp/yarn/nmPrivate/container_e22_1549663278916_0249_01_000001.tokens, /grid/a/tmp/yarn,
/grid/a/tmp/userlogs, /export/apps/jdk/JDK-1_8_0_172/jre/bin/java, -classpath, ..., -Xmx256m,
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer, khu,
application_1549663278916_0249, container_e22_1549663278916_0249_01_000001, ltx1-hcl7552.grid.linkedin.com,
8040, /grid/a/tmp/yarn]
{code}

So most likely a race condition between these two operations? 

cc [~jhung]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Mime
View raw message