1600372 : your jobs in NAF¶
Created: 2026-05-12T11:02:40Z - current status: new¶
"
Anonymized Summary¶
A user's HTCondor job submissions are causing a scheduler overload due to a logging file size issue. Specifically:
- The job attempts to write logs to a file path (/afs/desy.de/user/[USERNAME]/.../condor_output/[JOB_ID].log).
- The log file has grown too large (errno 27: File too large), preventing further writes.
- This is blocking the scheduler for all users of the CMS group, as the scheduler is stuck trying to handle the faulty jobs.
Solution¶
- Immediate Action:
- Stop submitting new jobs until the issue is resolved.
- Check the log file path for typos or misconfigurations (e.g., incorrect directory permissions, full filesystem).
-
Rotate or truncate the log file if it exceeds filesystem limits (e.g., AFS has a 2 GB file size limit).
-
Prevent Recurrence:
- Use Condor’s
logdirectives to split logs into smaller files (e.g.,log = job_$(Cluster)_$(Process).log). - Test jobs locally before large-scale submission to avoid scheduler overload.
-
Monitor job logs for early signs of issues (e.g., memory leaks, crashes).
-
Release Held Jobs:
- After fixing the log path, use
condor_release [JOB_ID]to restart held jobs.