1600562 : condor running degraded

Created: 2026-05-13T05:34:39Z - current status: new

Anonymized Summary: A user reports that the NAF Condor cluster has been experiencing degraded performance since approximately 18:00 the previous day, with only ~3,000 healthy nodes remaining. The issue coincides with job submissions from a specific experiment group ([EXPERIMENT_GROUP]).


Core Issue:

  • Cluster Degradation: Significant reduction in healthy worker nodes, potentially linked to job submissions from [EXPERIMENT_GROUP].
  • Possible Causes:
  • Short-Running Jobs: Jobs with runtimes <5 minutes may waste resources due to Condor’s negotiation overhead (see Job run times and job bugs).
  • Faulty Jobs: Unstable jobs crashing immediately could exploit the "fast lane" mechanism, leading to rapid job turnover and resource exhaustion (see Job run times and job bugs).
  • Memory Leaks: Jobs exceeding memory limits or leaking memory may trigger Condor to kill jobs, reducing available slots (see Job Requirements).
  • EL9 Migration Issues: Temporary workarounds during the EL9 migration (e.g., memory monitoring inaccuracies) may contribute to instability (see Migration to EL9).

Suggested Next Steps:

  1. Check Job Stability:
  2. Verify if [EXPERIMENT_GROUP] jobs are stable (e.g., no crashes or memory leaks).
  3. Use condor_q -held [USERNAME] to identify held jobs and their failure reasons (e.g., memory limits, runtime).

  4. Review Job Runtime:

  5. Ensure jobs run for >5 minutes to avoid negotiation overhead.
  6. For jobs failing quickly, test them interactively before large-scale submission.

  7. Monitor Cluster Status:

  8. Check real-time resource availability: NAF Stats.
  9. Look for patterns in job failures (e.g., specific users, job classes).

  10. EL9-Specific Actions:

  11. If jobs target EL9 nodes, confirm compatibility with known issues (e.g., materialize_max_idle not working).
  12. For memory-intensive jobs, account for temporary performance degradation due to kernel page flushing.

  13. Contact Support:

  14. If the issue persists, escalate to NAF admins with:
    • Job IDs of problematic submissions.
    • Output of condor_q -global for queue visualization.
    • Any error logs from held jobs.

Sources:

  1. Job run times and job bugs
  2. Job Requirements (and failures)
  3. Migration to EL9