1600562 : condor running degraded¶
Created: 2026-05-13T05:34:39Z - current status: new¶
Anonymized Summary: A user reports that the NAF Condor cluster has been experiencing degraded performance since approximately 18:00 the previous day, with only ~3,000 healthy nodes remaining. The issue coincides with job submissions from a specific experiment group ([EXPERIMENT_GROUP]).
Core Issue:¶
- Cluster Degradation: Significant reduction in healthy worker nodes, potentially linked to job submissions from [EXPERIMENT_GROUP].
- Possible Causes:
- Short-Running Jobs: Jobs with runtimes <5 minutes may waste resources due to Condor’s negotiation overhead (see Job run times and job bugs).
- Faulty Jobs: Unstable jobs crashing immediately could exploit the "fast lane" mechanism, leading to rapid job turnover and resource exhaustion (see Job run times and job bugs).
- Memory Leaks: Jobs exceeding memory limits or leaking memory may trigger Condor to kill jobs, reducing available slots (see Job Requirements).
- EL9 Migration Issues: Temporary workarounds during the EL9 migration (e.g., memory monitoring inaccuracies) may contribute to instability (see Migration to EL9).
Suggested Next Steps:¶
- Check Job Stability:
- Verify if [EXPERIMENT_GROUP] jobs are stable (e.g., no crashes or memory leaks).
-
Use
condor_q -held [USERNAME]to identify held jobs and their failure reasons (e.g., memory limits, runtime). -
Review Job Runtime:
- Ensure jobs run for >5 minutes to avoid negotiation overhead.
-
For jobs failing quickly, test them interactively before large-scale submission.
-
Monitor Cluster Status:
- Check real-time resource availability: NAF Stats.
-
Look for patterns in job failures (e.g., specific users, job classes).
-
EL9-Specific Actions:
- If jobs target EL9 nodes, confirm compatibility with known issues (e.g.,
materialize_max_idlenot working). -
For memory-intensive jobs, account for temporary performance degradation due to kernel page flushing.
-
Contact Support:
- If the issue persists, escalate to NAF admins with:
- Job IDs of problematic submissions.
- Output of
condor_q -globalfor queue visualization. - Any error logs from held jobs.