1592122 : Files on pnfs not accessible from condor workers

Created: 2026-04-14T08:47:32Z - current status: new

Anonymized Summary: A user from the [EXPERIMENT_GROUP] collaboration reports intermittent issues accessing files on pnfs (specifically in /pnfs/desy.de/[EXPERIMENT_GROUP]/tier2/store/...) from HTCondor workers. The problem: - Does not occur when accessing files locally from NAF login nodes. - Only affects jobs submitted via HTCondor, causing analysis code to fail. - Other members of the same experiment group observe similar issues. - Onset coincides with a scheduled downtime on April 8.


Suggested Solution/Next Steps:

  1. Verify Kerberos Authentication on Workers The issue may stem from expired or missing Kerberos tickets on Condor worker nodes. Users should:
  2. Check if jobs include a valid Kerberos ticket (kinit before submission).
  3. Ensure the job script renews credentials if needed (e.g., via aklog or krenew).
  4. Example command to list tickets: bash klist
  5. If missing, request a new ticket with: bash kinit

  6. Check for Experiment-Specific Infrastructure Issues Since the problem is observed by multiple users in the [EXPERIMENT_GROUP] group, it may be related to:

  7. dCache/PNFS access permissions for Condor workers.
  8. Network or storage backend issues post-downtime.
  9. Action: Contact the dedicated [EXPERIMENT_GROUP] support line: naf-[EXPERIMENT_GROUP]-support@desy.de (Replace [EXPERIMENT_GROUP] with the actual experiment name.)

  10. Debugging Steps for Condor Jobs

  11. Log Files: Check Condor job logs for errors (e.g., HoldReason or MemoryUsage). bash condor_q -held [USERNAME] condor_q [JOB_ID] -af HoldReason
  12. Test Job: Submit a minimal job to isolate the issue (e.g., a script that only lists files in /pnfs/...).

  13. Temporary Workaround If the issue is intermittent, retry failed jobs or use local scratch space (/scratch or /tmp) as a fallback.


Sources:

  1. NAF Support Contacts (Experiment-specific support lines).
  2. Condor Submit Errors (KRB Tickets) (Kerberos authentication for Condor jobs).
  3. Job Requirements and Failures (Debugging Condor job holds).