1584760 : Re: Condor jobs not running¶
Created: 2026-03-16T13:18:24Z - current status: new¶
**
Summary of the Issue¶
A user reports that multiple HTCondor jobs in the BIRD_cms.lite queue (and some in BIRD_cms.bide) remain in idle state for extended periods (since ~12:30–13:30 on 12 March 2026). The affected job IDs include: - 1997843 - 2001915–2001919 (new batch) - 1961226, 1961471–1961506 (earlier batch).
The user is running jobs on naf-cms11 and emphasizes urgency due to an upcoming CMS review deadline.
Possible Causes & Solutions¶
1. Check Job Status and Hold Reasons¶
- Action: Verify if jobs are idle (waiting for resources) or held (due to errors).
bash condor_q -held [USERNAME] # Check held jobs condor_q [JOB_ID] -af HoldReason RequestMemory MemoryUsage JobStatus # Detailed status - Expected Outcome:
- If jobs are held, the
HoldReasonwill indicate issues (e.g., memory limits, runtime limits, or path errors). - If jobs are idle, proceed to check quotas/priorities or scheduler issues.
2. Quota/Priority Constraints¶
- Context: The BIRD_cms.lite queue has a quota-based fairshare system (see Quotas and Priorities). If the user’s weighted usage is high, new jobs may be deprioritized.
- Action:
- Check the user’s priority and usage:
bash condor_userprio.desy # For batch jobs condor_userprio.gpu # For GPU jobs (if applicable) - Solution:
- If the user’s effective priority is low (high numerical value), jobs may wait longer. Wait for usage to decay (7-day rolling window).
- If the queue is over-subscribed, consider:
- Splitting jobs into smaller batches.
- Using BIRD_cms.bide (longer runtime) if jobs exceed 3-hour limits.
3. Scheduler Overload¶
- Context: The HTCondor scheduler (
bird-htc-sched21.desy.de) may be overloaded due to: - Faulty job submissions (e.g., invalid paths, full log directories).
- High-frequency polling (e.g.,
watch condor_qor automated scripts). - Action:
- Reduce polling frequency (avoid
watchor rapidcondor_qcalls). - Check for scheduler errors:
bash condor_q -global # See all jobs across schedulers - Solution:
- If the scheduler is unresponsive, wait for NAF admins to resolve the overload (running jobs are unaffected).
4. Resource Availability¶
- Context: The EL9 migration (completed July 2024) may cause temporary resource constraints (see Migration to EL9).
- Action:
- Check available cores/OS:
bash # View NAF resource stats (EL8 vs. EL9) firefox https://bird.desy.de/stats/day.html - Solution:
- If EL9 resources are limited, submit jobs to EL8-compatible WGS (e.g.,
naf-cms11with EL8 singularity images for CMS).
- If EL9 resources are limited, submit jobs to EL8-compatible WGS (e.g.,
5. Job Requirements¶
- Context: Jobs may be idle if they exceed default limits (e.g., 3h runtime, 1.5GB memory for
litejobs). - Action:
- Modify job requirements in the submit file:
bash # Example: Request 4h runtime and 2GB memory +RequestRuntime = 14400 # 4h in seconds RequestMemory = 2048 # 2GB - Release held jobs after editing:
bash condor_qedit [JOB_ID] RequestRuntime 14400 condor_release [JOB_ID]
6. Node-Specific Issues¶
- Context: Jobs may fail on specific worker nodes (e.g.,
bird812.desy.de). - Action:
- Identify problematic nodes:
bash condor_history -constraint 'JobStatus != 4' -af LastRemoteHost - Solution:
- Exclude problematic nodes in the submit file:
bash Requirements = (Machine =!= "bird812.desy.de")
- Exclude problematic nodes in the submit file:
Recommended Next Steps¶
- Check job status (
condor_q -heldandcondor_q [JOB_ID]). - Verify quotas/priorities (
condor_userprio.desy). - Reduce scheduler load (avoid rapid polling).
- Adjust job requirements (runtime/memory) if needed.
- Monitor resource availability (EL9/EL8 stats).
- Contact NAF admins if issues persist (provide job IDs and error logs).