1594810 : HTCondor held jobs stuck - schedd SECMAN:2007 error¶
Created: 2026-04-22T10:26:04Z - current status: new¶
Anonymized Summary:
A user reports four jobs stuck in the hold state on the HTCondor scheduler [SCHEDULER_HOST] (e.g., bird-htc-sched22.desy.de). The user is unable to remove or manage these jobs due to a persistent SECMAN:2007: Failed to end classad message error when executing any condor_* commands. The user confirms valid Kerberos authentication but cannot proceed further.
Job IDs (anonymized):
- [JOB_ID_1]
- [JOB_ID_2]
- [JOB_ID_3]
- [JOB_ID_4]
Core Issue:¶
- Jobs in Hold State: The jobs are stuck in "hold" and cannot be released or removed via standard commands (
condor_release,condor_rm). - Scheduler Overload: The
SECMAN:2007error suggests the scheduler is unresponsive, likely due to: - A high volume of faulty job submissions (e.g., incorrect paths, logging issues).
- Excessive polling of the scheduler (e.g., via
condor_qin scripts orwatch). - A denial-of-service-like state where the scheduler is overwhelmed.
Suggested Solution/Next Steps:¶
- Immediate Workaround:
- Avoid further
condor_qcommands to reduce scheduler load. -
Contact NAF/HTCondor admins to:
- Manually remove the stuck jobs from the scheduler’s queue.
- Check the scheduler’s status (e.g., CPU/memory usage, log files) for underlying issues.
-
Preventive Measures:
- Review job submission scripts for typos in paths (executable/logging/output).
- Limit job submissions until the scheduler recovers.
-
Use
condor_q -hold(if the scheduler responds) to identify the hold reason for future jobs. -
If the Issue Persists:
- The scheduler may need a restart (admin-only action).
- Monitor the NAF documentation or status page for updates on scheduler health.