1593890 : jobs get stuck on condor¶
Created: 2026-04-20T09:13:13Z - current status: new¶
Anonymized Summary:
A user reports that their HTCondor jobs (using the PocketCoffea framework with Parsl as the executor) frequently stall at 99% completion when processing samples from pnfs directories. The jobs only proceed after manually issuing condor_hold and condor_release commands. The issue emerged after a recent NAF update.
Possible Causes & Solutions:¶
- Resource Limits or Memory Leaks
- The jobs may be hitting memory limits or experiencing temporary resource contention, causing Condor to pause them. Since the jobs resume after a hold/release, this suggests Condor’s enforcement of limits is involved.
-
Action:
- Check job logs for
HoldReasonusing:bash condor_q [JOB_ID] -af HoldReason RequestMemory MemoryUsage - If memory usage is near the limit (default: 1500MB), increase the requested memory via
condor_qedit:bash condor_qedit [JOB_ID] RequestMemory [NEW_VALUE_IN_MB] condor_release [JOB_ID]
- Check job logs for
-
EL9 Migration-Related Issues
- The NAF recently migrated to EL9, which introduced temporary workarounds (e.g., kernel memory flushing) that may affect job stability.
-
Action:
- Ensure jobs are submitted to EL9-compatible workgroup servers (WGS). Use
condor_q -globalto verify job distribution. - If using CMS-specific workflows, confirm the correct Apptainer/Singularity image is specified (e.g.,
/cvmfs/unpacked.cern.ch/registry.hub.docker.com/cmssw/cc8:amd64for EL8 compatibility).
- Ensure jobs are submitted to EL9-compatible workgroup servers (WGS). Use
-
Parsl/PocketCoffea Integration Issues
- Parsl may not handle Condor’s hold/release cycles gracefully, causing jobs to hang at 99%.
-
Action:
- Test with a simpler Condor submission script (without Parsl) to isolate the issue.
- Check Parsl’s logs for errors during job finalization (e.g., file staging, cleanup).
-
pnfs Access Delays
- Jobs may stall while waiting for pnfs I/O operations (e.g., file closing, metadata updates).
-
Action:
- Add a timeout or retry logic in the job script for pnfs operations.
- Verify pnfs permissions and quotas for the user’s account.
-
Condor Scheduler Overload
- If the scheduler is busy (e.g., due to many held jobs), it may delay job updates.
- Action:
- Avoid frequent
condor_qpolling. Usecondor_q -globalsparingly.
- Avoid frequent
Recommended Next Steps:¶
- Inspect Held Jobs:
bash condor_q -held [USERNAME] -af HoldReason -
Look for patterns (e.g., memory limits, runtime limits).
-
Test Without Parsl:
-
Submit a minimal job directly via Condor to rule out framework-specific issues.
-
Monitor EL9 Resources:
-
Check core availability and OS version via BIRD Stats.
-
Contact NAF Support:
- If the issue persists, provide:
- Job IDs of stuck jobs.
- Output of
condor_q [JOB_ID] -l. - Parsl/PocketCoffea logs (if applicable).
Sources: - Job Requirements (NAF Documentation) - EL9 Migration Notes