How to run 'non-batch' workloads on the NAF¶

we experience for quite some time an unusual high load on a lot of the submit nodes across the different VOs usually due to people doing some heavy compute work on these nodes.

While we see the occasional need for some heavy interactive computation as it is not always feasible to come up with a complete condor submitable setup for some more urgent or individual tasks we would like to remind you that running these tasks on the login nodes is not the solution.

The high load on theses WGS leads to all kind of problems for you and the other tenants who are depending on running their submit frameworks on the very same hosts.

The grief then piles up in our RT-queue and there is not a lot we can do about it other than remind everyone that these nodes are multi tenant by design and need to be treated carefully.

The concept to run high-load-interactive work on the NAF is to start an interactive job that automatically provides ssh-login to a node with the reserved ressources you appended to the job.

See an example, this will provide ssh login on a workernode with 60GB memory, 8cores for 8h runtime:

[chbeyer@pal94]~% condor_submit -i -append request_cpus=8 -append request_memory=60GB -append request_runtime=28800 Submitting job(s). 1 job(s) submitted to cluster 2816959. Waiting for job to start...

check for the ressources¶

[chbeyer@pal94]~% condor_q 2817035 -af:l Requestmemory Requestruntime Requestcpus Requestmemory = 61440 Requestruntime = 28800 Requestcpus = 8

As you can see I have now 8h/8c/60GB reserved on a workernode. This ressource is not only guaranteed for my usage but it will also not bother anybody else :)

Keep in mind the memory boundaries of machines that can not be crossed (usually 256 GB is such a boundary)

Also the availability of slots gets more sparse the bigger you choose them and it might take a while before the jobs starts.

If you want to start a Jupyter Notebook inside this large slot please look here: