1600829 : Condor jobs almost all failing with TimeoutError

Created: 2026-05-13T13:23:17Z - current status: new

**

Anonymized Summary of the Issue

A user reports widespread job failures on the HTCondor system (part of the NAF infrastructure) since the previous day. Key observations: 1. Job Failures: Most jobs fail with TimeoutError in the error logs, despite the job content being unchanged from two weeks prior when they ran successfully. 2. Dask Management Issues: The Dask process managing the jobs is being killed spontaneously before a significant fraction of jobs complete, which is a new behavior. 3. Community Impact: Other users in the same experiment (CMS) are experiencing similar problems.

The error logs show repeated TimeoutError exceptions during worker initialization, leading to premature termination of Dask workers.


Possible Causes & Suggested Solutions

  1. System-Level Issues
  2. The errors suggest network or scheduler overload, possibly due to:
    • A high volume of faulty job submissions (e.g., misconfigured paths, full filesystems) causing scheduler congestion (see condor_q errors and explanations).
    • Network latency or connectivity problems between worker nodes and the scheduler.
  3. Next Steps:

    • Check the scheduler status (condor_q -global) for signs of overload.
    • Verify if other users are submitting jobs with misconfigured paths (e.g., logging directories, executables).
    • Contact NAF-CMS support (naf-cms-support@desy.de) to investigate potential system-wide issues.
  4. Dask-Specific Issues

  5. The TimeoutError during worker initialization suggests:
    • Resource starvation (e.g., worker nodes unable to allocate memory/CPU).
    • Network timeouts between the Dask scheduler and workers.
  6. Next Steps:

    • Reduce the number of concurrent workers to avoid overloading nodes.
    • Increase timeout settings in Dask configuration (e.g., distributed.comm.timeouts.connect).
    • Check for memory leaks or excessive resource usage in the jobs (see Job run times and job bugs).
  7. Job-Specific Checks

  8. Ensure jobs are stable and tested before large-scale submission (see Job run times and job bugs).
  9. Verify that output paths are valid and not exceeding filesystem limits (e.g., filename length, quota issues).

  1. Check Job Status:
  2. Run condor_q -hold to identify held jobs and their failure reasons.
  3. Run condor_history -constraint 'JobStatus == 3' to review removed jobs.

  4. Test with a Small Batch:

  5. Submit a single job to isolate whether the issue is systemic or job-specific.

  6. Contact Support:

  7. If the problem persists, escalate to NAF-CMS support (naf-cms-support@desy.de) with:
    • Job IDs of failed submissions.
    • Excerpts from error logs (already provided).
    • Confirmation that other users are affected.

Sources Used

  1. condor_q errors and explanations
  2. Job run times and job bugs
  3. Getting support: Experiment Support