1581535 : HTCondor waiting time

Created: 2026-03-04T10:28:13Z - current status: new

Here’s the anonymized and summarized version of the query, along with a solution based on the provided context:


Summary

A user submitted a job to HTCondor with specific memory requirements. While a similar job was processed quickly (within 30 minutes) two days prior, the current job remained idle overnight despite initially low queue activity (fewer than 10 idle jobs). By morning, the queue had grown significantly (1000+ idle jobs, hundreds running). The user questions: 1. Whether changes in HTCondor’s available resources (e.g., reduced computing power) could explain the delay. 2. If a single large job from another user might be blocking the queue. 3. How to predict whether their job will start soon (within 10–60 minutes) or remain stuck, given that neither the absolute number of idle/running jobs nor their ratio reliably indicates wait times.


Solution

1. Why Jobs Get Delayed

  • Resource Availability Fluctuates: The National Analysis Facility (NAF) dynamically allocates resources based on demand. If other users submit jobs with high memory/core requirements, fewer slots may be available for your job, even if the queue appears "empty" initially. This is normal and can vary hourly/daily. Example: A single job requesting 100+ cores or terabytes of memory could occupy resources for hours, delaying smaller jobs.

  • Job Requirements Matter: HTCondor prioritizes jobs based on matchmaking between job requirements (e.g., memory, runtime) and available resources. If your job requests more memory/cores than the default "lite" scheme (1 core, 1.5 GB, <3h runtime), it may wait longer for suitable slots. Key: Jobs with accurate requirements (close to actual needs) are scheduled more efficiently. Overestimating resources can lead to longer waits.

  • Queue Dynamics: The number of idle/running jobs is not a direct indicator of wait time. A queue with 1000 idle jobs might process quickly if most are small, while a queue with 10 idle jobs could stall if one job monopolizes resources.

2. How to Check Job Status

  • View Your Job’s Priority: Use condor_q -analyze [JOB_ID] to see why your job isn’t running. This shows:

    • Whether resources matching your requirements are available.
    • If other jobs are blocking yours (e.g., higher priority or preempting resources).
  • Check Resource Usage: Run condor_status to see available slots and their attributes (e.g., memory, cores). Compare this to your job’s requirements to gauge competition.

  • Monitor Hold Reasons: If your job is held, use condor_q -hold [JOB_ID] to identify issues (e.g., memory limits, output file errors). Adjust requirements with condor_qedit if needed.

3. Predicting Wait Times

  • Short-Term Prediction: Use condor_q -run -nobatch to see where jobs are running and how many slots are free. If many jobs are finishing soon (e.g., runtime nearing their limit), your job may start shortly. Tip: Jobs with shorter runtimes (e.g., <1h) are prioritized for quick turnaround.

  • Long-Term Strategy:

    • Submit jobs with precise requirements to avoid overbooking resources.
    • Split large jobs into smaller batches to reduce wait times.
    • Use condor_q -better-analyze to debug scheduling issues.

4. Next Steps

  • Check your job’s status: bash condor_q -analyze [JOB_ID] condor_q -run -nobatch
  • Adjust requirements if needed: bash condor_qedit [JOB_ID] RequestMemory=4096 # Example: 4GB condor_release [JOB_ID] # After editing

Sources

  1. Submitting Jobs with Specific Requirements
  2. Checking & Managing Jobs
  3. Job Requirements and Failures
  4. How Much Memory Does My Job Need?