1600829 : Condor jobs almost all failing with TimeoutError¶
Created: 2026-05-13T13:23:17Z - current status: new¶
**
Anonymized Summary of the Issue¶
A user reports widespread job failures on the HTCondor system (part of the NAF infrastructure) since the previous day. Key observations:
1. Job Failures: Most jobs fail with TimeoutError in the error logs, despite the job content being unchanged from two weeks prior when they ran successfully.
2. Dask Management Issues: The Dask process managing the jobs is being killed spontaneously before a significant fraction of jobs complete, which is a new behavior.
3. Community Impact: Other users in the same experiment (CMS) are experiencing similar problems.
The error logs show repeated TimeoutError exceptions during worker initialization, leading to premature termination of Dask workers.
Possible Causes & Suggested Solutions¶
- System-Level Issues
- The errors suggest network or scheduler overload, possibly due to:
- A high volume of faulty job submissions (e.g., misconfigured paths, full filesystems) causing scheduler congestion (see condor_q errors and explanations).
- Network latency or connectivity problems between worker nodes and the scheduler.
-
Next Steps:
- Check the scheduler status (
condor_q -global) for signs of overload. - Verify if other users are submitting jobs with misconfigured paths (e.g., logging directories, executables).
- Contact NAF-CMS support (naf-cms-support@desy.de) to investigate potential system-wide issues.
- Check the scheduler status (
-
Dask-Specific Issues
- The
TimeoutErrorduring worker initialization suggests:- Resource starvation (e.g., worker nodes unable to allocate memory/CPU).
- Network timeouts between the Dask scheduler and workers.
-
Next Steps:
- Reduce the number of concurrent workers to avoid overloading nodes.
- Increase timeout settings in Dask configuration (e.g.,
distributed.comm.timeouts.connect). - Check for memory leaks or excessive resource usage in the jobs (see Job run times and job bugs).
-
Job-Specific Checks
- Ensure jobs are stable and tested before large-scale submission (see Job run times and job bugs).
- Verify that output paths are valid and not exceeding filesystem limits (e.g., filename length, quota issues).
Recommended Immediate Actions¶
- Check Job Status:
- Run
condor_q -holdto identify held jobs and their failure reasons. -
Run
condor_history -constraint 'JobStatus == 3'to review removed jobs. -
Test with a Small Batch:
-
Submit a single job to isolate whether the issue is systemic or job-specific.
-
Contact Support:
- If the problem persists, escalate to NAF-CMS support (naf-cms-support@desy.de) with:
- Job IDs of failed submissions.
- Excerpts from error logs (already provided).
- Confirmation that other users are affected.