1580173 : Available naf cores decreasing?

Created: 2026-02-27T08:03:33Z - current status: new

**

Summary of the Issue

A user reports observing a decline in the number of available healthy slots in the NAF’s daily statistics, with the count dropping below 4,000. They suspect this may be due to a single user consuming 3,000–4,000 cores for an extended period and request an investigation.


Anonymized & Neutralized Report

A user noticed a reduction in available healthy slots in the NAF’s daily statistics, with the count falling below 4,000. They inquired whether this could be attributed to a single user occupying 3,000–4,000 cores for an extended duration and requested an administrative review.


Possible Explanation & Next Steps

  1. Quota & Fair-Share Mechanism The NAF employs a fair-share scheduling system with group/user quotas (e.g., BIRD_[experiment].lite/bide). If a user or group exceeds their weighted usage, their priority decreases, but they may still occupy resources if no higher-priority jobs are pending. The condor_userprio.desy/gpu commands can verify current allocations (see Quotas and Priorities).

  2. Surplus Quota Exploitation "Standard jobs" (1 core, 2 GB RAM, 3h runtime) can bypass group quotas by utilizing surplus resources. If many such jobs are running, they may temporarily reduce available slots (see Compute Resources).

  3. Migration to EL9 The ongoing EL9 migration (since July 2024) has reduced the total pool size. Some slots may be offline for upgrades, and EL9-specific issues (e.g., memory monitoring) could affect slot availability (see Migration Timeline).

  4. Recommended Actions

  5. Check Current Usage: Run condor_status -available to confirm slot availability. Use condor_userprio.desy to identify high-usage users/groups.
  6. Review Daily Statistics: The BIRD-day.html page tracks slot trends. A sudden drop may indicate a single user’s jobs, while gradual declines could reflect migration or maintenance.
  7. Contact Support: If the issue persists, open a ticket via naf-helpdesk@desy.de with:
    • Output of condor_q -global (for job distribution).
    • Screenshots of the daily statistics showing the decline.

Sources Used

  1. Quotas and Priorities
  2. Compute Resources & Background
  3. Migration to EL9
  4. Daily Statistics (BIRD-day.html)