1600829 : Condor jobs almost all failing with TimeoutError¶

Created: 2026-05-13T13:23:17Z - current status: new¶

**

Anonymized Summary of the Issue¶

A user reports widespread job failures on the HTCondor system (part of the NAF infrastructure) since the previous day. Key observations: 1. Job Failures: Most jobs fail with TimeoutError in the error logs, despite the job content being unchanged from two weeks prior when they ran successfully. 2. Dask Management Issues: The Dask process managing the jobs is being killed spontaneously before a significant fraction of jobs complete, which is a new behavior. 3. Community Impact: Other users in the same experiment (CMS) are experiencing similar problems.

The error logs show repeated TimeoutError exceptions during worker initialization, leading to premature termination of Dask workers.

Possible Causes & Suggested Solutions¶

System-Level Issues
The errors suggest network or scheduler overload, possibly due to:
- A high volume of faulty job submissions (e.g., misconfigured paths, full filesystems) causing scheduler congestion (see condor_q errors and explanations).
- Network latency or connectivity problems between worker nodes and the scheduler.
Next Steps:
- Check the scheduler status (condor_q -global) for signs of overload.
- Verify if other users are submitting jobs with misconfigured paths (e.g., logging directories, executables).
- Contact NAF-CMS support (naf-cms-support@desy.de) to investigate potential system-wide issues.
Dask-Specific Issues
The TimeoutError during worker initialization suggests:
- Resource starvation (e.g., worker nodes unable to allocate memory/CPU).
- Network timeouts between the Dask scheduler and workers.
Next Steps:
- Reduce the number of concurrent workers to avoid overloading nodes.
- Increase timeout settings in Dask configuration (e.g., distributed.comm.timeouts.connect).
- Check for memory leaks or excessive resource usage in the jobs (see Job run times and job bugs).
Job-Specific Checks
Ensure jobs are stable and tested before large-scale submission (see Job run times and job bugs).
Verify that output paths are valid and not exceeding filesystem limits (e.g., filename length, quota issues).

Recommended Immediate Actions¶

Check Job Status:
Run condor_q -hold to identify held jobs and their failure reasons.
Run condor_history -constraint 'JobStatus == 3' to review removed jobs.
Test with a Small Batch:
Submit a single job to isolate whether the issue is systemic or job-specific.
Contact Support:
If the problem persists, escalate to NAF-CMS support (naf-cms-support@desy.de) with:
- Job IDs of failed submissions.
- Excerpts from error logs (already provided).
- Confirmation that other users are affected.

1600829 : Condor jobs almost all failing with TimeoutError¶

Created: 2026-05-13T13:23:17Z - current status: new¶

Anonymized Summary of the Issue¶

Possible Causes & Suggested Solutions¶

Recommended Immediate Actions¶

Sources Used¶