1593890 : jobs get stuck on condor

Created: 2026-04-20T09:13:13Z - current status: new

Anonymized Summary: A user reports that their HTCondor jobs (using the PocketCoffea framework with Parsl as the executor) frequently stall at 99% completion when processing samples from pnfs directories. The jobs only proceed after manually issuing condor_hold and condor_release commands. The issue emerged after a recent NAF update.


Possible Causes & Solutions:

  1. Resource Limits or Memory Leaks
  2. The jobs may be hitting memory limits or experiencing temporary resource contention, causing Condor to pause them. Since the jobs resume after a hold/release, this suggests Condor’s enforcement of limits is involved.
  3. Action:

    • Check job logs for HoldReason using: bash condor_q [JOB_ID] -af HoldReason RequestMemory MemoryUsage
    • If memory usage is near the limit (default: 1500MB), increase the requested memory via condor_qedit: bash condor_qedit [JOB_ID] RequestMemory [NEW_VALUE_IN_MB] condor_release [JOB_ID]
  4. EL9 Migration-Related Issues

  5. The NAF recently migrated to EL9, which introduced temporary workarounds (e.g., kernel memory flushing) that may affect job stability.
  6. Action:

    • Ensure jobs are submitted to EL9-compatible workgroup servers (WGS). Use condor_q -global to verify job distribution.
    • If using CMS-specific workflows, confirm the correct Apptainer/Singularity image is specified (e.g., /cvmfs/unpacked.cern.ch/registry.hub.docker.com/cmssw/cc8:amd64 for EL8 compatibility).
  7. Parsl/PocketCoffea Integration Issues

  8. Parsl may not handle Condor’s hold/release cycles gracefully, causing jobs to hang at 99%.
  9. Action:

    • Test with a simpler Condor submission script (without Parsl) to isolate the issue.
    • Check Parsl’s logs for errors during job finalization (e.g., file staging, cleanup).
  10. pnfs Access Delays

  11. Jobs may stall while waiting for pnfs I/O operations (e.g., file closing, metadata updates).
  12. Action:

    • Add a timeout or retry logic in the job script for pnfs operations.
    • Verify pnfs permissions and quotas for the user’s account.
  13. Condor Scheduler Overload

  14. If the scheduler is busy (e.g., due to many held jobs), it may delay job updates.
  15. Action:
    • Avoid frequent condor_q polling. Use condor_q -global sparingly.

  1. Inspect Held Jobs: bash condor_q -held [USERNAME] -af HoldReason
  2. Look for patterns (e.g., memory limits, runtime limits).

  3. Test Without Parsl:

  4. Submit a minimal job directly via Condor to rule out framework-specific issues.

  5. Monitor EL9 Resources:

  6. Check core availability and OS version via BIRD Stats.

  7. Contact NAF Support:

  8. If the issue persists, provide:
    • Job IDs of stuck jobs.
    • Output of condor_q [JOB_ID] -l.
    • Parsl/PocketCoffea logs (if applicable).

Sources: - Job Requirements (NAF Documentation) - EL9 Migration Notes