SLURM
Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager and job scheduler designed for high-performance computing clusters. It is widely used in research, academia, and industry to efficiently manage and allocate computing resources such as CPUs, GPUs, memory, and storage for running various types of jobs and tasks. Slurm helps optimize resource utilization, minimizes job conflicts, and provides a flexible framework for distributing workloads across a cluster of machines. It offers features like job prioritization, fair sharing of resources, job dependencies, and real-time monitoring, making it an essential tool for orchestrating complex computational workflows in diverse fields.
Availability¶
Software | Module Load Command |
---|---|
slurm | module load wulver |
Please note that the module wulver
is already loaded when a user logs in to the cluster. If you use module purge
command, make sure to use module load wulver
in the slurm script to load SLURM.
Application Information, Documentation¶
The documentation of SLURM is available at SLURM manual.
Managing and Monitoring Jobs¶
SLURM has numerous tools for monitoring jobs. Below are a few to get started. More documentation is available on the SLURM website.
The most common commands are:
- List all current jobs:
squeue
- Job deletion:
scancel [job_id]
- Run a job:
sbatch [submit script]
- Run a command:
srun <slurm options> <command name>
SLURM User Commands¶
Task | Command |
---|---|
Job submission: | sbatch [script_file] |
Job deletion: | scancel [job_id] |
Job status by job: | squeue [job_id] |
Job status by user: | squeue -u [user_name] |
Job hold: | scontrol hold [job_id] |
Job release: | scontrol release [job_id] |
List enqueued jobs: | squeue |
List nodes: | sinfo -N OR scontrol show nodes |
Cluster status: | sinfo |
Using SLURM on Wulver¶
In Wulver, SLURM submission will have new requirements, intended for a more fair sharing of resources without impinging on investor/owner rights to computational resources. All jobs must now be charged to a PI-group (Principal Investigator) account.
Account (Use --account
)¶
To specify the job, use --account=$PI_ucid
. You can specify --account
as either an sbatch
or #SBATCH
parameter. If you don't know the UCID of PI, usequota_info
, and you will see SLURM account you sre associated with. Check quota_info
for details.
Partition (Use --partition
)¶
Wulver has three partitions, differing in GPUs or RAM available:
Partition | Nodes | Cores/Node | CPU | GPU | Memory | Service Unit (SU) Charge |
---|---|---|---|---|---|---|
--partition=general |
100 | 128 | 2.5G GHz AMD EPYC 7763 (2) | NA | 512 GB | 1 SU per hour per cpu |
--partition=debug |
1 | 4 | 2.5G GHz AMD EPYC 7763 (2) | NA | 512 GB | No charges, must be used with --qos=debug |
--partition=gpu |
25 | 128 | 2.0 GHz AMD EPYC 7713 (2) | NVIDIA A100 GPUs (4) | 512 GB | 3 SU per hour per GPU node |
--partition=bigmem |
2 | 128 | 2.5G GHz AMD EPYC 7763 (2) | NA | 2 TB | 1.5 SU per CPU hour |
Priority (Use --qos
)¶
Wulver has three levels of “priority”, utilized under SLURM as Quality of Service (QoS):
Qos | Purpose | Rules | Wall time limit (hours) | Valid Users |
---|---|---|---|---|
--qos=standard |
Normal jobs. Faculty PIs are allocated 300,000 Service Units (SU) per year | SU charges based on node type (see partitions table above), jobs can be preempted by high QoS enqueued jobs | 72 | Everyone |
--qos=low |
Free access, no SU charge | jobs can be preempted by high or standard QoS enqueued jobs | 72 | Everyone |
--qos=high_$PI |
Replace $PI with the UCID of PI, only available to owners/investors |
Highest Priority Jobs, no SU Charges. | 72 | owner/investor PI Groups |
--qos=debug |
Intended for debugging and testing jobs | No SU Charges, maximum 4 CPUs are allowed, must be used with --partition=debug |
8 | Everyone |
Check Quota¶
Faculty PIs are allocated 300,000 Service Units (SU) per year upon request at no cost, which can be utilized via --qos=standard
on the SLURM job. It's important to regularly check the usage of SUs so that users can be aware of their consumption and switch to --qos=low
to prevent exhausting all allocated SUs. Users can check their quota using the quota_info UCID
command.
1 2 3 4 5 6 7 8 9 10 |
|
xy1234
represents the UCID of the PI, and "SLURM Service Units (CPU Hours): 277557 (300000 Quota)" indicates that members of the PI group have already utilized 277,557 CPU hours out of the allocated 300,000 SUs, and the user xy1234
utilized 1703 CPU Hours out of 277,557 CPU Hours. This command also displays the storage usage of directories such as $HOME
, /project
, and /scratch
. Users can view both the group usage and individual usage of each storage. In the given example, the group usage from the 2TB project quota is 867 GB, with the user's usage being 11 GB out of that 867 GB. For more details file system quota, see Wulver Filesystem.
Example of slurm script¶
Submitting Jobs on CPU Nodes¶
Sample Job Script to use: submit.sh
#!/bin/bash -l
#SBATCH --job-name=job_nme
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=general
#SBATCH --qos=standard
#SBATCH --account=$PI_ucid # Replace $PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=59:00 # D-HH:MM:SS
#SBATCH --mem-per-cpu=4000M
./myexe <input/output options> # myexe is the executable in this example.
#!/bin/bash -l
#SBATCH --job-name=job_nme
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=general
#SBATCH --qos=standard
#SBATCH --account=$PI_ucid # Replace $PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=59:00 # D-HH:MM:SS
#SBATCH --mem-per-cpu=4000M
srun ./myexe <input/output options> # myexe is the executable in this example.
#!/bin/bash -l
#SBATCH --job-name=job_nme
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=general
#SBATCH --qos=standard
#SBATCH --account=$PI_ucid # Replace $PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=59:00 # D-HH:MM:SS
#SBATCH --mem-per-cpu=4000M
OMP_NUM_THREADS=$SLURM_NTASKS ./myexe <input/output options>
#!/bin/bash -l
#SBATCH --job-name=job_nme
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=general
#SBATCH --qos=standard
#SBATCH --account=$PI_ucid # Replace $PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=2
#SBATCH --time=59:00 # D-HH:MM:SS
#SBATCH --mem-per-cpu=4000M
srun gmx_mpi mdrun ... -ntomp $SLURM_CPUS_PER_TASK ...
Warning
Do not request multiple cores unless your code is parallelized. Before using multiple cores, ensure that your code is capable of parallelizing tasks; otherwise, it will unnecessarily consume service units (SUs) and may negatively impact performance. Please review the code's documentation thoroughly and use a single core if it does not support parallel execution.
- Here, the job requests 1 node on the
general
partition withqos=standard
. Please note that the memory relies on the number of cores you are requesting. - As per the policy, users can request up to 4GB memory per core, therefore the flag
--mem-per-cpu
is used for memory requirement. If you are using 1 core and need more memory, use--mem
instead. - In this above script
--time
indicates the wall time which is used to specify the maximum amount of time that a job is allowed to run. The maximum allowable wall time depends on SLURM QoS, which you can find in QoS. - To submit the job, use
sbatch submit.sh
where thesubmit.sh
is the job script. Once the job has been submitted, the jobs will be in the queue, which will be executed based on priority-based scheduling. - To check the status of the job use
squeue -u $LOGNAME
and you should see the followingHere, theJOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 635 general job_nme ucid R 00:02:19 1 n0088
ST
stands for the status of the job. You may see the status of the jobST
asPD
which means the job is pending and has not been assigned yet. The status change depends upon the number of users using the partition and resources requested in the job. Once the job starts, you will see the output file with an extension of.out
. If the job causes any errors, you can check the details of the error in the file with the.err
extension.
Submitting Jobs on GPU Nodes¶
In case of submitting the jobs on GPU, you can use the following SLURM script
Sample Job Script to use: gpu_submit.sh
#!/bin/bash -l
#SBATCH --job-name=gpu_job
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=gpu
#SBATCH --qos=standard
#SBATCH --account=$PI_ucid # Replace $PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --time=59:00 # D-HH:MM:SS
#SBATCH --mem-per-cpu=4000M
./myexe <input/output options> # myexe is the executable in this example.
#!/bin/bash -l
#SBATCH --job-name=gpu_job
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=gpu
#SBATCH --qos=standard
#SBATCH --account=$PI_ucid # Replace $PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:1
#SBATCH --time=59:00 # D-HH:MM:SS
#SBATCH --mem-per-cpu=4000M
srun ./myexe <input/output options> # myexe is the executable in this example.
#!/bin/bash -l
#SBATCH --job-name=gpu_job
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=gpu
#SBATCH --qos=standard
#SBATCH --account=$PI_ucid # Replace $PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:2
#SBATCH --time=59:00 # D-HH:MM:SS
#SBATCH --mem-per-cpu=4000M
srun ./myexe <input/output options> # myexe is the executable in this example.
Warning
Do not use multiple GPUs unless you are certain that your job's performance will benefit from them. Most GPU jobs do not require multiple CPUs either. Please remember that unnecessarily requesting additional resources can negatively impact job performance and will also consume more service units (SUs).
Submitting Jobs on debug
¶
The debug
QoS in Slurm is intended for debugging and testing jobs. It usually provides a shorter queue wait time and quicker job turnaround. Jobs submitted with the debug
QoS have access to a limited set of resources (Only 4 CPUS on Wulver), making it suitable for rapid testing and debugging of applications without tying up cluster resources for extended periods.
Sample Job Script to use: debug_submit.sh
#!/bin/bash -l
#SBATCH --job-name=debug
#SBATCH --output=%x.%j.out # %x.%j expands to slurm JobName.JobID
#SBATCH --error=%x.%j.err
#SBATCH --partition=debug
#SBATCH --qos=debug
#SBATCH --account=$PI_ucid # Replace $PI_ucid which the NJIT UCID of PI
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=7:59:00 # D-HH:MM:SS, Maximum allowable Wall Time 8 hours
#SBATCH --mem-per-cpu=4000M
./myexe <input/output options>
To submit the jobs, sbatch
command.
Interactive session on a compute node¶
Interactive sessions are useful for tasks that require direct interaction with the compute node's resources and software environment. To start an interactive session on the compute node, use interactive
after logging into Wulver.
The interactive
Command¶
We provide a built-in shortcut command, interactive
, that allows you to quickly and easily request a session in compute node.
The interactive
command acts as a convenient wrapper for Slurm’s salloc command. Similar to sbatch, which is used for batch jobs, salloc
is specifically designed for interactive jobs.
$ interactive -h
Usage: interactive -a ACCOUNT -q QOS -j JOB_TYPE
Starts an interactive SLURM job with the required account and QoS settings.
Required options:
-a ACCOUNT Specify the account to use.
-q QOS Specify the quality of service (QoS).
-j JOB_TYPE Specify the type of job: 'cpu' for CPU jobs or 'gpu' for GPU jobs.
Example: Run an interactive GPU job with the 'test' account and 'test' QoS:
/apps/site/bin/interactive -a test -q test -j gpu
This will launch an interactive job on the 'gpu' partition with the 'test' account and QoS 'test',
using 1 GPU, 1 CPU, and a walltime of 1 hour by default.
Optional parameters to modify resources:
-n NTASKS Specify the number of CPU tasks (Default: 1).
-t WALLTIME Specify the walltime in hours (Default: 1).
-g GPU Specify the number of GPUs (Only for GPU jobs, Default: 1).
-p PARTITION Specify the SLURM partition (Default: 'general' for CPU jobs, 'gpu' for GPU jobs).
Use '-h' to display this help message.
$ interactive -a $PI_UCID -q standard -j cpu
Starting an interactive session with the general partition and 1 core for 01:00:00 of walltime in standard priority
salloc: Pending job allocation 466577
salloc: job 466577 queued and waiting for resources
salloc: job 466577 has been allocated resources
salloc: Granted job allocation 466577
salloc: Nodes n0103 are ready for job
ssh
or srun
srun ./myexe <input/output options>
or
ssh n0103
$ interactive -a $PI_UCID -q standard -j gpu
Starting an interactive session with the GPU partition, 1 core and 1 GPU for 01:00:00 of walltime in standard priority
salloc: Pending job allocation 466579
salloc: job 466579 queued and waiting for resources
salloc: job 466579 has been allocated resources
salloc: Granted job allocation 466579
salloc: Nodes n0048 are ready for job
ssh
or srun
srun ./myexe <input/output options>
or
ssh n0048
$ interactive -a $PI_UCID -q debug -j cpu -p debug
Starting an interactive session with the debug partition and 1 core for 01:00:00 of walltime in debug priority
salloc: Pending job allocation 466581
salloc: job 466581 queued and waiting for resources
salloc: job 466581 has been allocated resources
salloc: Granted job allocation 466581
salloc: Waiting for resource configuration
salloc: Nodes n0127 are ready for job
ssh
or srun
.
srun ./myexe <input/output options>
or
ssh n0127
Replace $PI_UCID
with PI's NJIT UCID.
Now, once you get the confirmation of job allocation, you can either use srun
or ssh
to access the particular node allocated to the job.
Customizing Your Resources¶
Please note that, by default, this interactive session will request 1 core (for CPU jobs), 1 GPU (for GPU jobs), with a 1-hour walltime. To customize the resources, use the -h
option for help. Run interactive -h
for more details. Here is an explanation of each flag given below.
Flag | Explanation | Example |
---|---|---|
-a |
Mandatory option. This is followed by your group's name. Use quota_info to check the account/group name. |
-a $PI_UCID |
-q |
Mandatory option. Used to access the priority | -q standard |
-j |
Mandatory option. Specify whether you want CPU or GPU node. | -j cpu |
-n |
Optional. The total number of CPUs. Default is 1 core unless specified. | -n 1 |
-t |
Optional. The amount of walltime to reserve for your job in hours. Default is 1 hour unless specified. | -t 1 |
-g |
Optional. The total number of GPUs. Default is 1 GPU unless specified. | -g 1 |
-p |
Optional. Specify the name of the partition. (Default: general for CPU jobs, gpu for GPU jobs). |
-p debug |
Warning
Login nodes are not designed for running computationally intensive jobs. You can use the head node to edit and manage your files, or to run small-scale interactive jobs. The CPU usage is limited per user on the head node. Therefore, for serious computing either submit the job using sbatch
command or start an interactive session on the compute node.
Note
Please note that if you are using GPUs, check whether your script is parallelized. If your script is not parallelized and only depends on GPU, then you don't need to request more cores per node. In that case, do not use -n
while executing the interactive
command, as the default option will request 1 CPU per GPU. It's important to keep in mind that using multiple cores on GPU nodes may result in unnecessary CPU hour charges. Additionally, implementing this practice can make service unit accounting significantly easier.