1600372 : your jobs in NAF

Created: 2026-05-12T11:02:40Z - current status: new

"

Anonymized Summary

A user's HTCondor job submissions are causing a scheduler overload due to a logging file size issue. Specifically: - The job attempts to write logs to a file path (/afs/desy.de/user/[USERNAME]/.../condor_output/[JOB_ID].log). - The log file has grown too large (errno 27: File too large), preventing further writes. - This is blocking the scheduler for all users of the CMS group, as the scheduler is stuck trying to handle the faulty jobs.


Solution

  1. Immediate Action:
  2. Stop submitting new jobs until the issue is resolved.
  3. Check the log file path for typos or misconfigurations (e.g., incorrect directory permissions, full filesystem).
  4. Rotate or truncate the log file if it exceeds filesystem limits (e.g., AFS has a 2 GB file size limit).

  5. Prevent Recurrence:

  6. Use Condor’s log directives to split logs into smaller files (e.g., log = job_$(Cluster)_$(Process).log).
  7. Test jobs locally before large-scale submission to avoid scheduler overload.
  8. Monitor job logs for early signs of issues (e.g., memory leaks, crashes).

  9. Release Held Jobs:

  10. After fixing the log path, use condor_release [JOB_ID] to restart held jobs.

Sources