1598705 : maxwell cluster sh file error¶
Created: 2026-05-06T08:28:26Z - current status: new¶
Anonymized Summary:
A user reports that jobs that previously ran successfully on the Maxwell cluster (~10 months ago) now fail when using the same input files. The issue appears to be related to the job submission script (.sh file). The user suspects recent changes to the cluster (e.g., system configuration, environment, or job submission setup) may be the cause.
A test directory (/data/[GROUP]/[SUBDIRECTORY]/test_for_[PURPOSE]) has been created to investigate the issue, containing multiple test cases.
Possible Causes & Next Steps:
1. Environment/Module Changes:
- The Maxwell cluster regularly updates software modules and system configurations. The comsyl module or its dependencies (e.g., MPI, Python, UCX) may have been updated, causing incompatibilities.
- Action: Check the job output/error logs for specific error messages (e.g., missing libraries, module load failures). Compare the current environment (module list) with the environment used 10 months ago (if records exist).
- Job Script Syntax:
- The sample batch script in the Maxwell documentation (see source) includes directives like
unset LD_PRELOADand--mca pml ucx. If these were omitted or modified in the user’s script, jobs may fail. -
Action: Review the
.shscript for deviations from the sample script. Ensure all required directives (e.g.,module load maxwell comsyl) are present. -
Spack/Software Stack:
- If the user’s workflow relies on Spack-installed software, updates to Spack or its packages could introduce breaking changes.
-
Action: Verify if the
comsylmodule or its dependencies were installed via Spack. If so, check for version mismatches or rebuild the environment. -
Partition/Resource Limits:
- Changes to partition configurations (e.g., time limits, node requirements) might affect job submission.
-
Action: Confirm the
--partitionand--timedirectives in the script match current cluster policies. -
Test Case Analysis:
- The test directory (
test_for_[PURPOSE]) likely contains minimal examples to reproduce the issue. Running one of these with verbose logging (sbatch --verbose) could pinpoint the failure stage (e.g., module loading, MPI initialization).
Suggested Solution: 1. Debugging Steps: - Run a test job with the following modifications to the script: ```bash # Add verbose logging #SBATCH --output=debug_%j.out #SBATCH --error=debug_%j.err
# Load modules explicitly
module purge
module load maxwell comsyl
# Test basic functionality (e.g., run a minimal command)
echo "Testing module environment..."
module list
which python
```
-
Check the output/error files for clues (e.g., missing paths, library errors).
-
Compare Environments:
-
If possible, retrieve the environment from the original successful run (e.g., from old job logs) and compare it to the current environment.
-
Consult Documentation:
-
Review recent Maxwell cluster updates (e.g., blog posts) or user meeting notes for changes affecting
comsylor MPI. -
Contact Support:
- If the issue persists, share the anonymized error logs and script with the Maxwell support team for further analysis.
Sources: - Sample batch script for comsyl - Maxwell blog: Recent updates