1589627 : condor is almost dead¶
Created: 2026-04-04T01:11:29Z - current status: new¶
Anonymized Summary: A user reports significant performance degradation in the Condor job scheduling system over the past 1–2 days. Key issues include: - Jobs that typically complete in 1–2 hours now exceed the 3-hour runtime limit and are placed on hold. - The number of available healthy slots has dropped to ~50% of normal capacity. - A specific user ([USERNAME]) is currently utilizing 2.5k cores from the [EXPERIMENT_QUEUE] queue, which may be contributing to system strain.
Possible Causes & Solutions:¶
- Scheduler Overload (Most Likely)
- The Condor scheduler may be overwhelmed due to:
- A high volume of faulty jobs (e.g., typos in executable/log paths, full filesystem errors).
- Excessive
condor_qpolling (e.g., from automated frameworks orwatchcommands).
-
Action:
- Check for held jobs with
condor_q -holdto identify misconfigured submissions. - Reduce polling frequency if using automated tools.
- Verify log file paths and filesystem quotas.
- Check for held jobs with
-
Resource Contention
- The 2.5k-core user may be monopolizing resources, leaving fewer slots for others.
-
Action:
- Contact the user to optimize job runtime (e.g., merge short jobs, fix memory leaks).
- Monitor EL9 migration status (if applicable) via bird.desy.de/stats/day.html.
-
Known EL9 Issues
- If the cluster is undergoing EL9 migration, temporary instability may occur (e.g., false memory readings, scheduler changes).
-
Action:
- Use
condor_q -globalto track jobs across schedulers. - Check for memory consumption errors (kernel cache misreporting).
- Use
-
Job Runtime Efficiency
- Short jobs (<5 minutes) waste resources due to Condor negotiation overhead.
- Action:
- Encourage users to batch jobs or extend runtime to >5 minutes.
Next Steps:¶
- Immediate:
- Investigate held jobs (
condor_q -hold) and scheduler logs for errors. - Check bird.desy.de/stats/day.html for resource availability.
- Long-Term:
- Notify the 2.5k-core user to review job stability and runtime.
- Monitor for EL9-related bugs (e.g.,
materialize_max_idleissues).
Sources: 1. Condor_q Errors and Explanations 2. Job Runtimes and Efficiency 3. EL9 Migration Status