1600562 : condor running degraded¶

Created: 2026-05-13T05:34:39Z - current status: new¶

Anonymized Summary: A user reports that the NAF Condor cluster has been experiencing degraded performance since approximately 18:00 the previous day, with only ~3,000 healthy nodes remaining. The issue coincides with job submissions from a specific experiment group ([EXPERIMENT_GROUP]).

Core Issue:¶

Cluster Degradation: Significant reduction in healthy worker nodes, potentially linked to job submissions from [EXPERIMENT_GROUP].
Possible Causes:
Short-Running Jobs: Jobs with runtimes <5 minutes may waste resources due to Condor’s negotiation overhead (see Job run times and job bugs).
Faulty Jobs: Unstable jobs crashing immediately could exploit the "fast lane" mechanism, leading to rapid job turnover and resource exhaustion (see Job run times and job bugs).
Memory Leaks: Jobs exceeding memory limits or leaking memory may trigger Condor to kill jobs, reducing available slots (see Job Requirements).
EL9 Migration Issues: Temporary workarounds during the EL9 migration (e.g., memory monitoring inaccuracies) may contribute to instability (see Migration to EL9).

Suggested Next Steps:¶

Check Job Stability:
Verify if [EXPERIMENT_GROUP] jobs are stable (e.g., no crashes or memory leaks).
Use condor_q -held [USERNAME] to identify held jobs and their failure reasons (e.g., memory limits, runtime).
Review Job Runtime:
Ensure jobs run for >5 minutes to avoid negotiation overhead.
For jobs failing quickly, test them interactively before large-scale submission.
Monitor Cluster Status:
Check real-time resource availability: NAF Stats.
Look for patterns in job failures (e.g., specific users, job classes).
EL9-Specific Actions:
If jobs target EL9 nodes, confirm compatibility with known issues (e.g., materialize_max_idle not working).
For memory-intensive jobs, account for temporary performance degradation due to kernel page flushing.
Contact Support:
If the issue persists, escalate to NAF admins with:
- Job IDs of problematic submissions.
- Output of condor_q -global for queue visualization.
- Any error logs from held jobs.

1600562 : condor running degraded¶

Created: 2026-05-13T05:34:39Z - current status: new¶

Core Issue:¶

Suggested Next Steps:¶

Sources:¶