SLURM Commands

For a general introduction to using SLURM, watch the video tutorial that BYU put together.

Here's a useful cheatsheet of many of the most common Slurm commands.

Example submission scripts are available at our Git repository.

https://bitbucket.org/caltechimss/central-hpc-public/src/master/slurm-scripts/

To pull down the extended example script, run the following from cluster login node.

wget https://bitbucket.org/caltechimss/central-hpc-public/raw/master/slurm-scripts/extended-slurm-submission

Job Submission

Use the Script Generator to check for syntax. Each #SBATCH line contains a parameter that you can use on the command-line (e.g. --time=1:00:00).

sbatch is used to submit batch (non interactive) jobs. The output is sent by default to a file in your local directory: slurm-$SLURM_JOB_ID.out.

Most of you jobs will be submitted this way:

sbatch -A accounting_group your_batch_script

salloc is used to obtain a job allocation that can then be used for running within

srun is used to obtain a job allocation if needed and execute an application. It can also be used for distribute mpi processes in your job.

Environment Variables:

SLURM_JOB_ID - job ID
SLURM_SUBMIT_DIR - the directory you were in when sbatch was called
SLURM_CPUS_ON_NODE - how many CPU cores were allocated on this node
SLURM_JOB_NAME - the name given to the job
SLURM_JOB_NODELIST - the list of nodes assigned. potentially useful for distributing tasks
SLURM_JOB_NUMNODES -
SLURM_NPROCS - total number of CPUs allocated

Resource Requests

To run you job, you will need to specify what resources you need. These can be memory, cores, nodes, gpus, etc. There is a lot of flexibility in the scheduler to get specifically the resources you need.

--nodes - The number of nodes for the job (computers)
--mem - The amount of memory per node that your job needs
-n - The total number of tasks your job requires
--gres gpu:# - The number of GPUs per node you need in your job
--gres=gpu:type:# - You can also specify the type of gpu. We have mostly p100s, but also 2 v100s
--qos - The QOS you want to run in, currently normal or debug
--mem-per-cpu= - The amount of memory per cpu your job requires
-N - The minimum (and maximum) number of nodes required
--ntasks-per-node=# - tasks per node.
--exclusive - this will get you exclusive use of the node
--constraint= - constrain to particular nodes. use skylake, cascadelake, or broadwell for particular processor types
--partition= -which partition ot send the job to. This should be expansion or gpu.

Examples:

Request a single node with 2 P100 GPU's

#SBATCH --nodes=1

#SBATCH --gres=gpu:2

#SBATCH --partition=gpu

Request a single node with 1 V100 GPU. (Either 16GB or 32GB V100)

#SBATCH --nodes=1

#SBATCH --gres=gpu:v100:1

#SBATCH --partition=gpu

Request a single node with 1 V100 GPU. (Specifically, 32GB V100's. As the (4) 32GB V100 GPU's are on a cascade lake node, we need to constrain to that.)

#SBATCH --nodes=1

#SBATCH --gres=gpu:v100:1

#SBATCH --constraint="cascadelake"

#SBATCH --partition=gpu

Request that your job only runs on skylake or cascadelake cpu's.

#SBATCH --constraint=skylake|cascadelake

Important Notes on Job Submission:

Your jobs must specify a wallclock time using the "-t" option when submitting your jobs. If this time is exceeded, you job will be killed. It is best to give this up to the maximum time allowed at first to get an idea of how long it runs. After you know that, it is best to give it a more reasonable time limit. Setting a reasonable timelimit will increase your chance of running quickly based on the backfill algorithm used.

Your job will be charged to the account specified. We do not force you to set an account since many users will be in just one. If you are in more than one group, make sure that you specify the group that you are wanting to charge the job to. This is done by using the "-A" option when submitting the job.

You can see the accounts you are in using:

sacctmgr show user myusername accounts

You can change you default account using:

sacctmgr modify user myusername set defaultaccount=account

Note: Please choose wisely while setting your jobs wall time. As cluster policy we do not typically increase a running jobs wall time as it is both unfair to other users and could alter the reported start times of existing jobs in the queue. If you are unfamiliar with your codes performance we strongly recommending padding the wall time specified then work backwards.

Job/Queue Management

squeue is used to show the queue. By default it shows all jobs, regardless of state:

-l : long listing
-u username : only show the jobs of the chosen user
-A account : show jobs from a specifc group, usualyy a PI
--state=pending : show pending jobs
--state=running : show running jobs

scancel is used to cancel (i.e. kill) a job. Here are some options to use:

jobib : kill the job with that jobid
-u username : kill all jobs for the user
--state=running : kill jobs that are in state "running"
--state=pending: Kill jobs in state "pending"

you can stack these options to get a particular set of jobs. For example, "scancle -u foor --state=pending" will kill all penging jobs for user "foo"

scontrol show job is used to display job information for pending and running jobs. This displays information such as hold, resource requests, resource allocations, etc. This is agreat first step in chcking a job.

scontrol hold holds a job. Pass it a job ID (e.g. "scontrol hold 1234").

scontrol release releases a held job. Pass it a job ID (e.g. "scontrol release 1234").

Checking Usage

sreport is a good option for showing historical job usage by username or group.

To obtain usage of entire group.

sreport  -T gres/gpu,cpu   cluster accountutilizationbyuser start=01/01/18T00:00:00 end=now    -t hours account=<group-account-name>

To obtain usage of a single account.

sreport  -T gres/gpu,cpu   cluster accountutilizationbyuser start=01/01/18T00:00:00 end=now    -t hours user=<username>

sacct shows current and historical job information in more detail that sreport. Important options:

-S from_date: Show jobs that started on or after from_date. There are several valid formats, but the easiest is probably "MMDD". See "man sacct" for more options.
-l ("l" for "long"): gives more verbose information
-u someusername: limit output to jobs by someusername
-A someprofessor: limit output to jobs by someprofessor's research group
-j jobid: specify a particular job to examine
-o format options: see "man sacct" for more fields to examine; there are a lot

Example

sacct -u <username> -S 0101 --format JobId,AllocCPUs,UserCPU

Launching tasks within a job

MPI Jobs

mpirun

Both OpenMPI and Intel MPI have support for the slurm scheduler. It should take no special effort to run you job under the scheduler. They look for the environment variables set by Slurm when your job is allocated and it then able to use those to start the processes on the correct number of nodes and the specific hosts:

mpirun executable options

srun

srun is the task launcher for slurm. It is built with PMI support, so it is a great way to start processes on the nodes for you mpi workflow. srun launches the processes more efficiently and faster than mpirun. All processes launched by srun will be consolidated into one job step which makes it easier to see where time was spent in a job. When using mpirun it sees each process at its own step,

Typically you can just use srun as you would mpirun since it is aware of mpi, and the allocations for your job:

srun executable options

srun will run processes on all nodes and task processors allocated within the job. You can specifcy differently though if you prefer.

Embarrassingly Parallel Jobs

Embarrassingly parallel jobs is a term used to indicate jobs that can be run independently from each other, but benefit by being run a large numbers of times. It is not a term of derision. Monte Carlo simulations fall into this category and are a very common use case in high throughput computing. Depending on you use case, you may use srun, surm arrays, gnu parallel, or some other framework to launch jobs.

Interactive Jobs

Command Line access

To run get a shell on a compute node with allocated resources to use interactively you can use the following command, specifying the information needed such as queue, time, nodes, and tasks:

srun --pty -t hh:mm:ss -n tasks -N nodes /bin/bash -l

This is a good way to interactively debug your code or try new things. You can also specify specific resources you need in here such as GPUs or memory.

X11

You can also get an X11 application to run from a compute node through an allocation. To do this make sure that you have a xserver working on your local system. Make sure that you are forwarding X connections through your ssh connection (-X). To do this use the --x11 option to set up the forwarding:

srun --x11 -t hh:mm:ss -N 1 xterm

Keep in mind that this is likely to be slow and the session will end if the ssh connection is terminated. A more robust solution is to use FastX. Click here for FastX tutorial.