1584024 : Condor jobs not running¶
Created: 2026-03-12T14:10:58Z - current status: new¶
Here is the anonymized and summarized report:
Summary of Issue¶
A user reports that their HTCondor jobs have been in idle state for several hours (since ~12:30 and 13:30) in the BIRD_cms.lite and BIRD_cms.bide queues. The jobs in question are identified by IDs [JOB_ID_1] and [JOB_ID_RANGE]. The user is working on [WORKER_NODE].
Possible Causes¶
- Scheduler Overload
The HTCondor scheduler (
bird-htc-sched21.desy.de) may be overloaded due to: - A high volume of faulty job submissions (e.g., incorrect executable/logging paths, full filesystem issues).
-
Excessive polling of
condor_q(e.g., via automated scripts orwatchcommands). (Source: condor_q errors and explanations) -
Quota or Priority Limits The BIRD_cms.lite and BIRD_cms.bide queues may have reached their group quotas or user priority thresholds, delaying job scheduling.
- The BIRD_cms.lite quota is 7741.94, with 824 jobs currently running (high utilization).
-
The BIRD_cms.bide quota is 3870.97, with 880 jobs running. (Source: Quotas and priorities)
-
Resource Contention If other users in the same group are submitting large job batches, the scheduler may prioritize their jobs due to fair-share policies (based on recent usage).
Suggested Solutions¶
- Check Job Status and Hold Reasons
Run the following commands to diagnose why jobs are idle:
bash condor_q [JOB_ID] -af HoldReason RequestMemory MemoryUsage JobStatus condor_q -held # List all held jobs -
If jobs are held, correct the issue (e.g., adjust memory/runtime requirements) and release them:
bash condor_release [JOB_ID] -
Verify Queue Priorities Check the user/group priority for the queues:
bash condor_userprio.desy # For batch jobs condor_userprio.gpu # For GPU jobs (if applicable) -
If the effective priority is low, jobs may be delayed until higher-priority jobs complete.
-
Reduce Scheduler Load
- Avoid frequent
condor_qpolling (e.g., removewatchcommands or automated scripts). -
Ensure no faulty jobs are being resubmitted in bulk.
-
Contact Support If jobs remain idle for >24 hours or if the scheduler is unresponsive, escalate to the NAF support team with:
- Job IDs.
- Queue names (
BIRD_cms.lite,BIRD_cms.bide). - Output of
condor_q -better-analyze [JOB_ID].