GPU on NAF¶
We provide some GPU resource.¶
Regular rules for batchjobs do apply for GPU jobs !
Remember for GPU jobs (also interactive GPU jobs) the same rules apply as for regular batchjobs, especially the time limit is automatically set to a 3h job-lifetime-span unless you set it otherwise using
+RequestRuntime = \<seconds> # requested runtime in seconds
in your submit file !
See: Submitting Jobs
Interactive GPU ressources¶
For users of atlas, cms, ilc belle, we provide shared GPU access to interactive WGS. Ask your group admin to add the resource 'nafgpu' to your account:
naf-atlas-gpu01.desy.de
naf-cms-gpu01.desy.de
naf-ilc-gpu01.desy.de
naf-belle-gpu01.desy.de
These machines have a standard WGS installation, plus some GPU related software. (Ask naf-helpdesk if you are missing something).
Beware: These are shared resources, so use this for development and testing only!
Access to these machines is via ssh or FastX.
The GPU hardware currently is one NVIDIA P100 per server.
Regular GPU ressources in the batch pool¶
Access is restricted to people with the nafgpu resource. Contact your experiment support (and they should contact naf-helpdesk) for access to this resource.
Once you have the needed resource, add:
[ ... ] Request_GPUs = 1 [ ... ]
to your job submit file
GPU resources are sparse, and since usage is exclusive, they are precious. Try to efficiently make use of your allocated compute time on the GPUs.
The GPU hardware currently in use is one NVIDIA GeForce GTX 1080Ti per server. List GPU batchnodes and state
[flemming@pal53]~% condor_status -constraint 'GPUs >= 1'
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
batchg001.desy.de LINUX X86_64 Claimed Busy 1.280 46758 1+20:27:59
batchg002.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:57:50
batchg003.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+02:19:51
batchg004.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+02:15:10
batchg005.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:41:59
batchg006.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:52:32
batchg007.desy.de LINUX X86_64 Claimed Busy 0.980 46778 0+02:23:47
batchg008.desy.de LINUX X86_64 Unclaimed Idle 0.000 46778 1+21:44:58
batchg009.desy.de LINUX X86_64 Claimed Busy 0.980 46778 0+00:18:21
slot1_1@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+01:08:15
slot1_2@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+02:44:29
slot1_3@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+03:33:18
slot1_4@batchg010.desy.de LINUX X86_64 Claimed Busy 0.000 50176 1+19:19:38
batchg011.desy.de LINUX X86_64 Unclaimed Idle 0.000 385437 1+23:44:52
batchg012.desy.de LINUX X86_64 Claimed Busy 0.000 385437 2+01:10:08
batchg013.desy.de LINUX X86_64 Claimed Busy 1.000 385437 2+01:08:57
Total Owner Claimed Unclaimed Matched Preempting Backfill Drain
X86_64/LINUX 16 0 14 2 0 0 0 0
Total 16 0 14 2 0 0 0 0
Anaconda on the NAF¶
Access to the Anaconda repositories is blocked, as Anaconda doesn't consider DESY as academic and use might be subject to license fees. For details, you can check this link from the Maxwell documentation: https://docs.desy.de/maxwell/documentation/licensing/conda_terms/
As a replacement, you can use mamba. You can try to use the pre-installed version via module load mamba/3.10 or install your own mamba version.
List Capabilities and Software versions of BIRD GPU nodes¶
[root@bird-htc-sched21 ~]# condor_status -constraint 'gpus >= 1' -af:h Name GPUs_Capability GPUs_DeviceName GPUs_DriverVersion GPUs_GlobalMemoryMb GPUs_DriverVersion
Name GPUs_Capability GPUs_DeviceName GPUs_DriverVersion GPUs_GlobalMemoryMb GPUs_DriverVersion
slot1@batchg001.desy.de 6.1 NVIDIA GeForce GTX 1080 Ti 12.6 11165 12.6
slot1_1@batchg002.desy.de 6.1 NVIDIA GeForce GTX 1080 Ti 12.6 11165 12.6
slot1@batchg004.desy.de 6.1 NVIDIA GeForce GTX 1080 Ti 12.6 11165 12.6
slot1@batchg007.desy.de 6.1 NVIDIA GeForce GTX 1080 Ti 12.6 11165 12.6
slot1_1@batchg008.desy.de 6.1 NVIDIA GeForce GTX 1080 Ti 12.6 11165 12.6
slot1@batchg009.desy.de 6.1 NVIDIA GeForce GTX 1080 Ti 12.6 11165 12.6
slot1_1@batchg010.desy.de 7.0 Tesla V100-SXM2-32GB 12.6 32494 12.6
slot1_2@batchg010.desy.de 7.0 Tesla V100-SXM2-32GB 12.6 32494 12.6
slot1_3@batchg010.desy.de 7.0 Tesla V100-SXM2-32GB 12.6 32494 12.6
slot1_4@batchg010.desy.de 7.0 Tesla V100-SXM2-32GB 12.6 32494 12.6
slot1_1@batchg011.desy.de 7.0 Tesla V100-PCIE-32GB 12.6 32494 12.6
slot1_1@batchg012.desy.de 7.0 Tesla V100-PCIE-32GB 12.6 32494 12.6
slot1_1@batchg013.desy.de 7.0 Tesla V100-PCIE-32GB 12.6 32494 12.6
Request certain GPU capabilities or types¶
any of the classadd entries above can be requested insid your job submit file using the common request syntax:
e.g. request a Tesla V100 GPU:
Requirements = (GPUs_DeviceName == "Tesla V100-PCIE-32GB")
Multiple requirements can be listed using '&&'.
List GPU jobs¶
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -constraint 'RequestGPUs >= 1'
-- Schedd: bird-htc-sched02.desy.de : <131.169.56.95:9618?... @ 11/07/18 10:36:14
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
chbeyer ID: 9902561 11/7 09:44 _ 1 8 11 9902561.2-10
chbeyer ID: 9903331 11/7 10:31 _ _ 11 11 9903331.0-10
chbeyer ID: 9903332 11/7 10:31 _ _ 11 11 9903332.0-10
GPU-top¶
[root@batchg001 ~]# nvidia-smi
Wed Nov 7 10:38:21 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:17:00.0 Off | N/A |
|100% 90C P2 79W / 250W | 10077MiB / 11178MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 279258 C .../htcondor_exec/gpu-burn-master/gpu_burn 10067MiB |
+-----------------------------------------------------------------------------+
Examples¶
Get an interactive session on GPU node¶
The shortest possible way
condor_submit -i -append "RequestGPUs = 1"
[chbeyer@batchg002]~/htcondor/testjobs% cat gpu_interactive.submit
Request_GPUs = 1
queue
[chbeyer@htc-it02]~/htcondor/testjobs% condor_submit -i gpu_interactive.submit
Submitting job(s).
1 job(s) submitted to cluster 4777582.
Waiting for job to start...
Welcome to batchg002.desy.de!
[chbeyer@batchg002]~/htcondor/testjobs% nvidia-smi
Thu Feb 21 09:15:12 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:17:00.0 Off | N/A |
| 47% 48C P5 15W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Get an interactive session an a node with specific properties¶
you can address any listed classadd of a node as mandatory for your job (see "List Capabilities and Software versions of BIRD GPU nodes") for ex:
[chbeyer@batchg002]~/htcondor/testjobs% cat gpu_interactive.submit
Requirements = OpSysAndVer == "CentOS7" && (CUDAGlobalMemoryMb > 10000) && (CUDARuntimeVersion == 8.0)
Request_GPUs = 1
queue
[chbeyer@htc-it02]~/htcondor/testjobs% condor_submit -i gpu_interactive.submit
Submitting job(s).
1 job(s) submitted to cluster 4777641.
Waiting for job to start...
Welcome to batchg002.desy.de!
Regular batchjob using GPU ressources on the NAF¶
[chbeyer@batchg002]~/htcondor/testjobs% cat sleep.submit
# Unix submit description file
# sleep.sub -- simple sleep job using GPU ressources
executable = /afs/desy.de/user/c/chbeyer/htcondor_exec/sleep_runtime.sh
output = /afs/desy.de/user/c/chbeyer/out_$(Cluster)_$(Process).txt
error = /afs/desy.de/user/c/chbeyer/error_$(Cluster)_$(Process).txt
Requirements = OpSysAndVer == "CentOS7"
Request_GPUs = 1
# uncomment this if you want to use the job specific variables $CLUSTER and $PROCESS inside your batchjob
# #environment = "CLUSTER=$(Cluster) PROCESS=$(Process)"
# uncomment this to specify a runtime longer than 3 hours (time in seconds)
#+RequestRuntime = 6000
# uncomment this to specify an argument given to the executable
#Args = 20
# uncomment this to give this batchjob an individual name-tag to find it easily in the queue
#batch_name = sleep_test_2
queue 1