1589627 : condor is almost dead

Created: 2026-04-04T01:11:29Z - current status: new

Anonymized Summary: A user reports significant performance degradation in the Condor job scheduling system over the past 1–2 days. Key issues include: - Jobs that typically complete in 1–2 hours now exceed the 3-hour runtime limit and are placed on hold. - The number of available healthy slots has dropped to ~50% of normal capacity. - A specific user ([USERNAME]) is currently utilizing 2.5k cores from the [EXPERIMENT_QUEUE] queue, which may be contributing to system strain.


Possible Causes & Solutions:

  1. Scheduler Overload (Most Likely)
  2. The Condor scheduler may be overwhelmed due to:
    • A high volume of faulty jobs (e.g., typos in executable/log paths, full filesystem errors).
    • Excessive condor_q polling (e.g., from automated frameworks or watch commands).
  3. Action:

    • Check for held jobs with condor_q -hold to identify misconfigured submissions.
    • Reduce polling frequency if using automated tools.
    • Verify log file paths and filesystem quotas.
  4. Resource Contention

  5. The 2.5k-core user may be monopolizing resources, leaving fewer slots for others.
  6. Action:

    • Contact the user to optimize job runtime (e.g., merge short jobs, fix memory leaks).
    • Monitor EL9 migration status (if applicable) via bird.desy.de/stats/day.html.
  7. Known EL9 Issues

  8. If the cluster is undergoing EL9 migration, temporary instability may occur (e.g., false memory readings, scheduler changes).
  9. Action:

    • Use condor_q -global to track jobs across schedulers.
    • Check for memory consumption errors (kernel cache misreporting).
  10. Job Runtime Efficiency

  11. Short jobs (<5 minutes) waste resources due to Condor negotiation overhead.
  12. Action:
    • Encourage users to batch jobs or extend runtime to >5 minutes.

Next Steps:

  • Immediate:
  • Investigate held jobs (condor_q -hold) and scheduler logs for errors.
  • Check bird.desy.de/stats/day.html for resource availability.
  • Long-Term:
  • Notify the 2.5k-core user to review job stability and runtime.
  • Monitor for EL9-related bugs (e.g., materialize_max_idle issues).

Sources: 1. Condor_q Errors and Explanations 2. Job Runtimes and Efficiency 3. EL9 Migration Status