1584760 : Re: Condor jobs not running

Created: 2026-03-16T13:18:24Z - current status: new

**

Summary of the Issue

A user reports that multiple HTCondor jobs in the BIRD_cms.lite queue (and some in BIRD_cms.bide) remain in idle state for extended periods (since ~12:30–13:30 on 12 March 2026). The affected job IDs include: - 1997843 - 2001915–2001919 (new batch) - 1961226, 1961471–1961506 (earlier batch).

The user is running jobs on naf-cms11 and emphasizes urgency due to an upcoming CMS review deadline.


Possible Causes & Solutions

1. Check Job Status and Hold Reasons

  • Action: Verify if jobs are idle (waiting for resources) or held (due to errors). bash condor_q -held [USERNAME] # Check held jobs condor_q [JOB_ID] -af HoldReason RequestMemory MemoryUsage JobStatus # Detailed status
  • Expected Outcome:
  • If jobs are held, the HoldReason will indicate issues (e.g., memory limits, runtime limits, or path errors).
  • If jobs are idle, proceed to check quotas/priorities or scheduler issues.

2. Quota/Priority Constraints

  • Context: The BIRD_cms.lite queue has a quota-based fairshare system (see Quotas and Priorities). If the user’s weighted usage is high, new jobs may be deprioritized.
  • Action:
  • Check the user’s priority and usage: bash condor_userprio.desy # For batch jobs condor_userprio.gpu # For GPU jobs (if applicable)
  • Solution:
    • If the user’s effective priority is low (high numerical value), jobs may wait longer. Wait for usage to decay (7-day rolling window).
    • If the queue is over-subscribed, consider:
    • Splitting jobs into smaller batches.
    • Using BIRD_cms.bide (longer runtime) if jobs exceed 3-hour limits.

3. Scheduler Overload

  • Context: The HTCondor scheduler (bird-htc-sched21.desy.de) may be overloaded due to:
  • Faulty job submissions (e.g., invalid paths, full log directories).
  • High-frequency polling (e.g., watch condor_q or automated scripts).
  • Action:
  • Reduce polling frequency (avoid watch or rapid condor_q calls).
  • Check for scheduler errors: bash condor_q -global # See all jobs across schedulers
  • Solution:
    • If the scheduler is unresponsive, wait for NAF admins to resolve the overload (running jobs are unaffected).

4. Resource Availability

  • Context: The EL9 migration (completed July 2024) may cause temporary resource constraints (see Migration to EL9).
  • Action:
  • Check available cores/OS: bash # View NAF resource stats (EL8 vs. EL9) firefox https://bird.desy.de/stats/day.html
  • Solution:
    • If EL9 resources are limited, submit jobs to EL8-compatible WGS (e.g., naf-cms11 with EL8 singularity images for CMS).

5. Job Requirements

  • Context: Jobs may be idle if they exceed default limits (e.g., 3h runtime, 1.5GB memory for lite jobs).
  • Action:
  • Modify job requirements in the submit file: bash # Example: Request 4h runtime and 2GB memory +RequestRuntime = 14400 # 4h in seconds RequestMemory = 2048 # 2GB
  • Release held jobs after editing: bash condor_qedit [JOB_ID] RequestRuntime 14400 condor_release [JOB_ID]

6. Node-Specific Issues

  • Context: Jobs may fail on specific worker nodes (e.g., bird812.desy.de).
  • Action:
  • Identify problematic nodes: bash condor_history -constraint 'JobStatus != 4' -af LastRemoteHost
  • Solution:
    • Exclude problematic nodes in the submit file: bash Requirements = (Machine =!= "bird812.desy.de")

  1. Check job status (condor_q -held and condor_q [JOB_ID]).
  2. Verify quotas/priorities (condor_userprio.desy).
  3. Reduce scheduler load (avoid rapid polling).
  4. Adjust job requirements (runtime/memory) if needed.
  5. Monitor resource availability (EL9/EL8 stats).
  6. Contact NAF admins if issues persist (provide job IDs and error logs).

Sources

  1. Job Requirements (and failures)
  2. Quotas and Priorities
  3. Migration to EL9
  4. Condor_q Errors
  5. Lagging Jobs