1591262 : Machines on condor with more CPUs

Created: 2026-04-11T07:30:52Z - current status: new

Anonymized Summary: A user is submitting a computationally intensive job (a long and heavy fit) via HTCondor on the NAF infrastructure. The current submission script requests a large amount of memory (RequestMemory = 99999) and runtime (+RequestRuntime = 50000000), but only uses the default of 1 CPU core. The user inquires whether machines with more than 1 CPU are available to expedite the job.


Solution & Recommendations:

  1. Request Additional CPUs: The user can explicitly request multiple CPU cores by adding: plaintext RequestCPU = <N> # Replace <N> with the desired number of cores (e.g., 2, 4, 8) Example for 4 cores: plaintext RequestCPU = 4 Note: Requesting more cores will increase the job's "cost" in terms of user priority, potentially leading to longer queue times (see Job Requirements).

  2. Adjust Memory per Core (If Needed): The current RequestMemory = 99999 (99.999 GB) is extremely high. If the job is parallelized across multiple cores, ensure the memory request is per core (e.g., 25 GB per core for 4 cores). If the total memory is fixed, reduce the request accordingly to avoid over-allocation.

  3. Benchmark-Based Node Selection (Optional): If the job benefits from high-performance nodes, the user can prioritize machines with higher benchmark scores (e.g., kflops or mips) using the Rank expression. Example: plaintext Rank = kflops Warning: This may further limit available slots (see Selecting Nodes).

  4. Optimize Runtime: The requested runtime (50000000 seconds ≈ 578 days) is unrealistic. Condor jobs typically have hard limits (e.g., 72 hours for non-lite jobs). The user should:

  5. Estimate the actual runtime and request a realistic value (e.g., +RequestRuntime = 86400 for 24 hours).
  6. Split the job into smaller chunks if possible (e.g., using queue <N> with different input files).

Revised Submission Script Example:

executable          = xfitter-HERAPDFMTop.sub
transfer_executable = True
universe            = vanilla
output              = xfitter.out
error               = xfifter.err
log                 = xfifter.log
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
environment = "CLUSTER=$(Cluster) PROCESS_ID=$(Process) Ireplica=$ENV(pdfid)"
RequestMemory       = 25000  # 25 GB (adjust based on needs)
RequestCPU          = 4      # 4 cores
+RequestRuntime     = 86400  # 24 hours (adjust to actual runtime)
queue 1

Key Considerations:

  • Priority Impact: Requesting more resources (CPUs/memory) will reduce the user’s priority, increasing queue times for future jobs.
  • Job Splitting: If the fit can be parallelized (e.g., by parameter space), consider submitting multiple smaller jobs (see Foreach-Style Submission).
  • Testing: Test with smaller resource requests first to gauge performance.

Sources:

  1. Job Requirements - NAF Documentation
  2. Selecting Nodes by Benchmark Power - NAF Documentation
  3. Sophisticated Job Submission - NAF Documentation