GPU on NAF¶

We provide some GPU resource.¶

Regular rules for batchjobs do apply for GPU jobs !

Remember for GPU jobs (also interactive GPU jobs) the same rules apply as for regular batchjobs, especially the time limit is automatically set to a 3h job-lifetime-span unless you set it otherwise using

+RequestRuntime = \<seconds> # requested runtime in seconds

in your submit file !

See: Submitting Jobs

Interactive GPU ressources¶

For users of atlas, cms, ilc belle, we provide shared GPU access to interactive WGS. Ask your group admin to add the resource 'nafgpu' to your account:

naf-atlas-gpu01.desy.de
naf-cms-gpu01.desy.de
naf-ilc-gpu01.desy.de
naf-belle-gpu01.desy.de

These machines have a standard WGS installation, plus some GPU related software. (Ask naf-helpdesk if you are missing something).

Beware: These are shared resources, so use this for development and testing only!

Access to these machines is via ssh or FastX.

The GPU hardware currently is one NVIDIA P100 per server.

Regular GPU ressources in the batch pool¶

Access is restricted to people with the nafgpu resource. Contact your experiment support (and they should contact naf-helpdesk) for access to this resource.

Once you have the needed resource, add:

[ ... ] Request_GPUs = 1 [ ... ]

to your job submit file

GPU resources are sparse, and since usage is exclusive, they are precious. Try to efficiently make use of your allocated compute time on the GPUs.

The GPU hardware currently in use is one NVIDIA GeForce GTX 1080Ti per server. List GPU batchnodes and state

[flemming@pal53]~% condor_status -constraint 'GPUs >= 1' 
   Name                       OpSys  Arch     State  Activity LoadAv Mem ActvtyTime

batchg001.desy.de LINUX X86_64 Claimed Busy 1.280 46758 1+20:27:59
batchg002.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:57:50
batchg003.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+02:19:51
batchg004.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+02:15:10
batchg005.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:41:59
batchg006.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:52:32
batchg007.desy.de LINUX X86_64 Claimed Busy 0.980 46778 0+02:23:47
batchg008.desy.de LINUX X86_64 Unclaimed Idle 0.000 46778 1+21:44:58
batchg009.desy.de LINUX X86_64 Claimed Busy 0.980 46778 0+00:18:21
slot1_1@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+01:08:15
slot1_2@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+02:44:29
slot1_3@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+03:33:18
slot1_4@batchg010.desy.de LINUX X86_64 Claimed Busy 0.000 50176 1+19:19:38
batchg011.desy.de LINUX X86_64 Unclaimed Idle 0.000 385437 1+23:44:52
batchg012.desy.de LINUX X86_64 Claimed Busy 0.000 385437 2+01:10:08
batchg013.desy.de LINUX X86_64 Claimed Busy 1.000 385437 2+01:08:57

                             Total Owner Claimed Unclaimed Matched Preempting Backfill Drain

X86_64/LINUX   16         0         14              2                 0              0                 0           0

Total                     16        0         14              2                 0               0                 0           0

Anaconda on the NAF¶

Access to the Anaconda repositories is blocked, as Anaconda doesn't consider DESY as academic and use might be subject to license fees. For details, you can check this link from the Maxwell documentation: https://docs.desy.de/maxwell/documentation/licensing/conda_terms/

As a replacement, you can use mamba. You can try to use the pre-installed version via module load mamba/3.10 or install your own mamba version.

List Capabilities and Software versions of BIRD GPU nodes¶

[root@bird-htc-sched21 ~]# condor_status -constraint 'gpus >= 1' -af:h Name GPUs_Capability GPUs_DeviceName  GPUs_DriverVersion GPUs_GlobalMemoryMb GPUs_DriverVersion
Name                      GPUs_Capability       GPUs_DeviceName            GPUs_DriverVersion    GPUs_GlobalMemoryMb GPUs_DriverVersion   
slot1@batchg001.desy.de   6.1                   NVIDIA GeForce GTX 1080 Ti 12.6                  11165               12.6                 
slot1_1@batchg002.desy.de 6.1                   NVIDIA GeForce GTX 1080 Ti 12.6                  11165               12.6                 
slot1@batchg004.desy.de   6.1                   NVIDIA GeForce GTX 1080 Ti 12.6                  11165               12.6                 
slot1@batchg007.desy.de   6.1                   NVIDIA GeForce GTX 1080 Ti 12.6                  11165               12.6                 
slot1_1@batchg008.desy.de 6.1                   NVIDIA GeForce GTX 1080 Ti 12.6                  11165               12.6                 
slot1@batchg009.desy.de   6.1                   NVIDIA GeForce GTX 1080 Ti 12.6                  11165               12.6                 
slot1_1@batchg010.desy.de 7.0                   Tesla V100-SXM2-32GB       12.6                  32494               12.6                 
slot1_2@batchg010.desy.de 7.0                   Tesla V100-SXM2-32GB       12.6                  32494               12.6                 
slot1_3@batchg010.desy.de 7.0                   Tesla V100-SXM2-32GB       12.6                  32494               12.6                 
slot1_4@batchg010.desy.de 7.0                   Tesla V100-SXM2-32GB       12.6                  32494               12.6                 
slot1_1@batchg011.desy.de 7.0                   Tesla V100-PCIE-32GB       12.6                  32494               12.6                 
slot1_1@batchg012.desy.de 7.0                   Tesla V100-PCIE-32GB       12.6                  32494               12.6                 
slot1_1@batchg013.desy.de 7.0                   Tesla V100-PCIE-32GB       12.6                  32494               12.6

Request certain GPU capabilities or types¶

any of the classadd entries above can be requested insid your job submit file using the common request syntax:

e.g. request a Tesla V100 GPU:

Requirements = (GPUs_DeviceName == "Tesla V100-PCIE-32GB")

Multiple requirements can be listed using '&&'.

List GPU jobs¶

[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -constraint 'RequestGPUs >= 1'
-- Schedd: bird-htc-sched02.desy.de : <131.169.56.95:9618?... @ 11/07/18 10:36:14
OWNER   BATCH_NAME     SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
chbeyer ID: 9902561  11/7  09:44      _      1      8     11 9902561.2-10
chbeyer ID: 9903331  11/7  10:31      _      _     11     11 9903331.0-10
chbeyer ID: 9903332  11/7  10:31      _      _     11     11 9903332.0-10

GPU-top¶

[root@batchg001 ~]# nvidia-smi
Wed Nov  7 10:38:21 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
|100%   90C    P2    79W / 250W |  10077MiB / 11178MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    279258      C   .../htcondor_exec/gpu-burn-master/gpu_burn 10067MiB |
+-----------------------------------------------------------------------------+

Examples¶

Get an interactive session on GPU node¶

The shortest possible way

condor_submit -i -append "RequestGPUs = 1"

[chbeyer@batchg002]~/htcondor/testjobs% cat gpu_interactive.submit 
Request_GPUs = 1
queue

[chbeyer@htc-it02]~/htcondor/testjobs% condor_submit -i gpu_interactive.submit
Submitting job(s).
1 job(s) submitted to cluster 4777582.
Waiting for job to start...
Welcome to batchg002.desy.de!

[chbeyer@batchg002]~/htcondor/testjobs% nvidia-smi 
Thu Feb 21 09:15:12 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
| 47%   48C    P5    15W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Get an interactive session an a node with specific properties¶

you can address any listed classadd of a node as mandatory for your job (see "List Capabilities and Software versions of BIRD GPU nodes") for ex:

[chbeyer@batchg002]~/htcondor/testjobs% cat gpu_interactive.submit 
Requirements = OpSysAndVer == "CentOS7" && (CUDAGlobalMemoryMb > 10000) && (CUDARuntimeVersion == 8.0)
Request_GPUs = 1
queue

[chbeyer@htc-it02]~/htcondor/testjobs% condor_submit -i gpu_interactive.submit
Submitting job(s).
1 job(s) submitted to cluster 4777641.
Waiting for job to start...
Welcome to batchg002.desy.de!

Regular batchjob using GPU ressources on the NAF¶

[chbeyer@batchg002]~/htcondor/testjobs% cat  sleep.submit
# Unix submit description file
# sleep.sub -- simple sleep job using GPU ressources

executable              = /afs/desy.de/user/c/chbeyer/htcondor_exec/sleep_runtime.sh
output                  = /afs/desy.de/user/c/chbeyer/out_$(Cluster)_$(Process).txt
error                   = /afs/desy.de/user/c/chbeyer/error_$(Cluster)_$(Process).txt
Requirements = OpSysAndVer == "CentOS7"
Request_GPUs = 1

# uncomment this if you want to use the job specific variables $CLUSTER and $PROCESS inside your batchjob
# #environment = "CLUSTER=$(Cluster) PROCESS=$(Process)"

# uncomment this to specify a runtime longer than 3 hours (time in seconds)
#+RequestRuntime = 6000

# uncomment this to specify an argument given to the executable
#Args = 20

# uncomment this to give this batchjob an individual name-tag to find it easily in the queue 
#batch_name = sleep_test_2

queue 1