1598949 : NAF jobs seem to stop/hang mid way through¶
Created: 2026-05-06T16:01:54Z - current status: new¶
Anonymized Summary: A user reports that a subset of their data processing jobs on the NAF cluster fail to complete, despite running successfully locally in minutes. The issue affects different jobs upon re-submission, suggesting potential problems with specific worker nodes rather than the jobs themselves. The user observes that a small number of jobs remain in the "RUN" state indefinitely (e.g., 204/3341 and 51/3341 jobs in two submissions).
Possible Causes and Solutions:
1. Node-Specific Issues:
- Some worker nodes may be experiencing hardware failures, network connectivity problems, or resource contention (e.g., memory/disk pressure).
- Next Step: Check the status of the stuck jobs using:
bash
condor_q <JOB_ID> -af HoldReason RemoteHost RemoteWallClockTime
This will show the node (RemoteHost) where the job is running and how long it has been active. If jobs are stuck on the same node(s), those nodes may be problematic.
- Job Requirements Exceeded:
- Jobs may be silently failing due to exceeding runtime or memory limits but not being killed by Condor (e.g., if the job hangs instead of crashing).
-
Next Step: Inspect job resource usage with:
bash condor_q <JOB_ID> -af RequestMemory MemoryUsage RequestRuntime RemoteWallClockTimeCompareMemoryUsagetoRequestMemoryandRemoteWallClockTimetoRequestRuntime. If jobs are nearing limits, adjust requirements (e.g., increaseRequestRuntimeorRequestMemory). -
Fast Lane Overload:
- If jobs are very short (<5 minutes), the "fast lane" mechanism may cause inefficiencies or delays in scheduling (see Job run times and job bugs).
-
Next Step: Ensure jobs run for at least 5 minutes. If unavoidable, batch smaller jobs together to reduce overhead.
-
File System Issues:
- Jobs may hang while accessing shared filesystems (e.g.,
/nfsor/pnfs). Network latency or node-specific filesystem problems could cause delays. -
Next Step: Test staging input/output files to local scratch space on the worker node (see Staging Files).
-
Condor Hold State:
- Jobs might be stuck in a "held" state due to transient errors but not visible in the queue.
- Next Step: Check for held jobs with:
bash condor_q -held <USERNAME>
Recommended Actions:
1. Identify the stuck jobs and their assigned nodes using condor_q.
2. Check resource usage and hold reasons for those jobs.
3. If specific nodes are problematic, exclude them from future submissions using job requirements (e.g., Requirements = (Machine != "bad-node.example.com")).
4. For short jobs, consider batching or increasing runtime to avoid fast-lane inefficiencies.
Sources: - Job run times and job bugs - Job Requirements (and failures) - Best Practices