1584760 : Re: Condor jobs not running¶

Created: 2026-03-16T13:18:24Z - current status: new¶

**

Summary of the Issue¶

A user reports that multiple HTCondor jobs in the BIRD_cms.lite queue (and some in BIRD_cms.bide) remain in idle state for extended periods (since ~12:30–13:30 on 12 March 2026). The affected job IDs include: - 1997843 - 2001915–2001919 (new batch) - 1961226, 1961471–1961506 (earlier batch).

The user is running jobs on naf-cms11 and emphasizes urgency due to an upcoming CMS review deadline.

Possible Causes & Solutions¶

1. Check Job Status and Hold Reasons¶

Action: Verify if jobs are idle (waiting for resources) or held (due to errors). bash condor_q -held [USERNAME] # Check held jobs condor_q [JOB_ID] -af HoldReason RequestMemory MemoryUsage JobStatus # Detailed status
Expected Outcome:
If jobs are held, the HoldReason will indicate issues (e.g., memory limits, runtime limits, or path errors).
If jobs are idle, proceed to check quotas/priorities or scheduler issues.

2. Quota/Priority Constraints¶

Context: The BIRD_cms.lite queue has a quota-based fairshare system (see Quotas and Priorities). If the user’s weighted usage is high, new jobs may be deprioritized.
Action:
Check the user’s priority and usage: bash condor_userprio.desy # For batch jobs condor_userprio.gpu # For GPU jobs (if applicable)
Solution:
- If the user’s effective priority is low (high numerical value), jobs may wait longer. Wait for usage to decay (7-day rolling window).
- If the queue is over-subscribed, consider:
- Splitting jobs into smaller batches.
- Using BIRD_cms.bide (longer runtime) if jobs exceed 3-hour limits.

3. Scheduler Overload¶

Context: The HTCondor scheduler (bird-htc-sched21.desy.de) may be overloaded due to:
Faulty job submissions (e.g., invalid paths, full log directories).
High-frequency polling (e.g., watch condor_q or automated scripts).
Action:
Reduce polling frequency (avoid watch or rapid condor_q calls).
Check for scheduler errors: bash condor_q -global # See all jobs across schedulers
Solution:
- If the scheduler is unresponsive, wait for NAF admins to resolve the overload (running jobs are unaffected).

4. Resource Availability¶

Context: The EL9 migration (completed July 2024) may cause temporary resource constraints (see Migration to EL9).
Action:
Check available cores/OS: bash # View NAF resource stats (EL8 vs. EL9) firefox https://bird.desy.de/stats/day.html
Solution:
- If EL9 resources are limited, submit jobs to EL8-compatible WGS (e.g., naf-cms11 with EL8 singularity images for CMS).

5. Job Requirements¶

Context: Jobs may be idle if they exceed default limits (e.g., 3h runtime, 1.5GB memory for lite jobs).
Action:
Modify job requirements in the submit file: bash # Example: Request 4h runtime and 2GB memory +RequestRuntime = 14400 # 4h in seconds RequestMemory = 2048 # 2GB
Release held jobs after editing: bash condor_qedit [JOB_ID] RequestRuntime 14400 condor_release [JOB_ID]

6. Node-Specific Issues¶

Context: Jobs may fail on specific worker nodes (e.g., bird812.desy.de).
Action:
Identify problematic nodes: bash condor_history -constraint 'JobStatus != 4' -af LastRemoteHost
Solution:
- Exclude problematic nodes in the submit file: bash Requirements = (Machine =!= "bird812.desy.de")

Recommended Next Steps¶

Check job status (condor_q -held and condor_q [JOB_ID]).
Verify quotas/priorities (condor_userprio.desy).
Reduce scheduler load (avoid rapid polling).
Adjust job requirements (runtime/memory) if needed.
Monitor resource availability (EL9/EL8 stats).
Contact NAF admins if issues persist (provide job IDs and error logs).