Skip to content

A few condor jobs lagging behind

If you experience some jobs being slow or in genreal not doing what you suppose them to and what the others do there maybe general problems in the pool with some nodes being under a high load etc.

Here are some strategies to investigate further if maybe there is a so called black hole worker that can be avoided easily:

check the workernode the jobs did not succeed on

Based on the runtime ( < 10 sec):

condor_history -constraint 'RemoteWallClockTime < 10' -af LastRemoteHost

Based on condor exit status

condor_history -constraint 'JobStatus != 4' -af LastRemoteHost

Based on exit status of your CMD (need to know which exit status is positive)

condor_history -constraint 'ExitStatus != <your job status>' -af LastRemoteHost

If you can isolate your problem to happen only on one or two nodes you can exclude these nodes in your next submit by adding this to your submit file (in this case for bird812):

Requirements = (Machine =!= "bird812.desy.de")

Do not keep these lists forever though, they will narrow your available ressources and let us know - we will fix the problem on the workernode.