1584024 : Condor jobs not running

Created: 2026-03-12T14:10:58Z - current status: new

Here is the anonymized and summarized report:


Summary of Issue

A user reports that their HTCondor jobs have been in idle state for several hours (since ~12:30 and 13:30) in the BIRD_cms.lite and BIRD_cms.bide queues. The jobs in question are identified by IDs [JOB_ID_1] and [JOB_ID_RANGE]. The user is working on [WORKER_NODE].

Possible Causes

  1. Scheduler Overload The HTCondor scheduler (bird-htc-sched21.desy.de) may be overloaded due to:
  2. A high volume of faulty job submissions (e.g., incorrect executable/logging paths, full filesystem issues).
  3. Excessive polling of condor_q (e.g., via automated scripts or watch commands). (Source: condor_q errors and explanations)

  4. Quota or Priority Limits The BIRD_cms.lite and BIRD_cms.bide queues may have reached their group quotas or user priority thresholds, delaying job scheduling.

  5. The BIRD_cms.lite quota is 7741.94, with 824 jobs currently running (high utilization).
  6. The BIRD_cms.bide quota is 3870.97, with 880 jobs running. (Source: Quotas and priorities)

  7. Resource Contention If other users in the same group are submitting large job batches, the scheduler may prioritize their jobs due to fair-share policies (based on recent usage).


Suggested Solutions

  1. Check Job Status and Hold Reasons Run the following commands to diagnose why jobs are idle: bash condor_q [JOB_ID] -af HoldReason RequestMemory MemoryUsage JobStatus condor_q -held # List all held jobs
  2. If jobs are held, correct the issue (e.g., adjust memory/runtime requirements) and release them: bash condor_release [JOB_ID]

  3. Verify Queue Priorities Check the user/group priority for the queues: bash condor_userprio.desy # For batch jobs condor_userprio.gpu # For GPU jobs (if applicable)

  4. If the effective priority is low, jobs may be delayed until higher-priority jobs complete.

  5. Reduce Scheduler Load

  6. Avoid frequent condor_q polling (e.g., remove watch commands or automated scripts).
  7. Ensure no faulty jobs are being resubmitted in bulk.

  8. Contact Support If jobs remain idle for >24 hours or if the scheduler is unresponsive, escalate to the NAF support team with:

  9. Job IDs.
  10. Queue names (BIRD_cms.lite, BIRD_cms.bide).
  11. Output of condor_q -better-analyze [JOB_ID].

References

  1. condor_q errors and explanations
  2. Quotas and priorities