Main content blocks
Section outline
-
Welcome to HPC Scheduling — Slurm Basics. This course turns the canonical Sigma2 "hpc-intro" episode on the job scheduler into a guided pathway with short assessments after each topic.
What you'll learn
- Why HPC systems use a scheduler and what role Slurm plays.
- How to submit a batch job, monitor it, and find the output.
- How to write a reusable
#SBATCHscript with sensible resource requests. - How to cancel jobs and start an interactive session on a compute node.
How the course is structured
- Each of the six content topics has a short reading followed by a quiz.
- The readings summarise the concept in ~300 words and link to the canonical Sigma2 episode for the full walkthrough.
- Quizzes use multiple-choice, short-answer, and "fill-in-the-script" Cloze questions. All grading is automatic.
- The Final Assessment in the last section is a complete SBATCH scenario — work through it once you've finished topics 1–6.
Before you start
You don't need cluster access to take the quizzes — every example is grader-checked against an expected string, not against a live Slurm queue. If you do have an NRIS account, follow along on Saga / Olivia / Betzy / Fox while you work.
Source material based on Sigma2's hpc-intro tutorial (episode 13). The canonical lesson is linked at the bottom of every topic.
-
Why HPC systems need a scheduler, and what role Slurm plays.
-
-
The classic first job is a one-line script that prints the hostname of whatever node it lands on:
#!/bin/bash echo -n "This script is running on " hostnameSave it as
example-job.sh, mark it executable (chmod +x), then submit withsbatch. On Saga the minimum required flags are--account,--time,--mem:sbatch --account=nn9970k --mem=1G --time=01:00 example-job.shCheck what's queued or running with
squeue -u $USER. To watch it refresh:squeue -u $USER --iterate 5(Ctrl-C to stop).Mandatory flags differ per cluster — Betzy also needs
--nodesand a partition. The job script generator is the quickest way to get a working starter for each cluster. -
-
When a batch job runs, its stdout and stderr are captured into a file in the directory you submitted from, named
slurm-<JOBID>.outby default.$ ls example-job.sh slurm-14738683.outInside that file you'll find: the node the job ran on, your script's own output, accounting (CPU-hours consumed against the project), wallclock vs CPU time, exit status, and a memory summary.
Wallclock time is real elapsed time. CPU time is the time the CPU was actively running your code. A 2-minute wallclock job that mostly waits on I/O might only consume a few seconds of CPU; a 2-minute job using 4 cores at 100% consumes ~8 CPU-minutes.
If your script fails partway through, the error message lands in the same file — useful for post-mortem when you can't watch the job live.
-
-
Typing every flag on the command line gets old. Slurm reads
#SBATCHcomment lines at the top of a script as if they were command-line flags — the script becomes self-contained and reproducible.#!/bin/bash #SBATCH --account=nn9970k #SBATCH --time=01:00 #SBATCH --mem=1G #SBATCH --job-name=hello #SBATCH --output=output-%j.txt #SBATCH --error=error-%j.txt ./example-job.shThen just
sbatch myscript.sh— the flags inside the script take effect. Anything passed on the command line still wins over an in-script directive, so you can override for one-off runs.Useful directive shortcuts:
%jin--output/--errorfilenames is replaced with the job ID.--job-name=sets the name shown insqueue.--mail-type=END,FAIL+--mail-user=…emails you when the job finishes or fails.
The quiz for this topic asks you to fill in a SBATCH script — pay attention to the unit format (Slurm accepts both
1Gand1024Mfor memory, both01:00and1:00for time).-
Fill in the SBATCH directives. This is the load-bearing exercise for this topic.
-
Four flags matter most of the time:
--time=<d-hh:mm:ss>— wallclock. Days part is optional;30:00means 30 minutes,1:00:00means 1 hour.--mem=<value>— memory per node (1G,4096M, etc.).--ntasks=<N>— total CPU cores.--nodes=<N>— how many distinct machines.
Two non-obvious truths:
- Requesting more does not make your job faster. Asking for 32 cores when your script uses 1 just wastes the other 31 and makes you wait longer in the queue.
- Exceeding what you asked for kills the job. If your job runs longer than
--time, Slurm cancels it with "TIME LIMIT" — output up to that point is still in the .out file.
If you skip the flags entirely, you get the cluster's defaults, which are usually a tiny allocation that's only useful for "hello world" scripts.
-
To cancel a queued or running job, use
scancel <JOBID>. The job ID is whatsbatchprinted when you submitted; you can also find it viasqueue -u $USER.$ scancel 38759 $ squeue -u $USER # (job is gone within a few seconds)Sometimes you don't want a batch job — you want a shell on a compute node so you can poke at things interactively. Use
salloc:salloc --account=nn9970k --time=30:00 --ntasks=1 --mem=4GThe session lives as long as you stay connected. Drop your laptop's network for too long and the job dies. To survive disconnects, wrap your work in
tmuxon the login node before callingsalloc— you can re-attach withtmux attachafter reconnecting.One catch on NRIS clusters: each
tmuxsession is bound to the specific login node where you started it. If you ssh in and get a different login node, ssh again specifying the one your session is on (ssh login-1). -
-
One quiz that pulls everything together — including a multi-line SBATCH script you build from scratch.
-
Everything together — including a complete SBATCH script you write from scratch.
-
