Course: HPC Scheduling — Slurm Basics

Section outline

Select section Welcome

Collapse Expand
Welcome

Collapse all Expand all
Welcome to HPC Scheduling — Slurm Basics. This course turns the canonical Sigma2 "hpc-intro" episode on the job scheduler into a guided pathway with short assessments after each topic.

What you'll learn

Why HPC systems use a scheduler and what role Slurm plays.

How to submit a batch job, monitor it, and find the output.

How to write a reusable #SBATCH script with sensible resource requests.

How to cancel jobs and start an interactive session on a compute node.

How the course is structured

Each of the six content topics has a short reading followed by a quiz.

The readings summarise the concept in ~300 words and link to the canonical Sigma2 episode for the full walkthrough.

Quizzes use multiple-choice, short-answer, and "fill-in-the-script" Cloze questions. All grading is automatic.

The Final Assessment in the last section is a complete SBATCH scenario — work through it once you've finished topics 1–6.

Before you start

You don't need cluster access to take the quizzes — every example is grader-checked against an expected string, not against a live Slurm queue. If you do have an NRIS account, follow along on Saga / Olivia / Betzy / Fox while you work.

Source material based on Sigma2's hpc-intro tutorial (episode 13). The canonical lesson is linked at the bottom of every topic.
- Select activity Announcements
  
  Announcements Forum
Select section What is a scheduler?

Collapse Expand
What is a scheduler?
Why HPC systems need a scheduler, and what role Slurm plays.
- Select activity Quiz: What is a scheduler?
  
  Quiz: What is a scheduler?
  
  Students must
  
  Mark as done
  
  Two quick questions to lock in the concept.
- Select activity 1. What is a scheduler?
  
  1. What is a scheduler? Page
  
  Students must
  
  Mark as done
Select section Submit your first job

Collapse Expand
Submit your first job
The classic first job is a one-line script that prints the hostname of whatever node it lands on:

#!/bin/bash echo -n "This script is running on " hostname

Save it as example-job.sh, mark it executable (chmod +x), then submit with sbatch. On Saga the minimum required flags are --account, --time, --mem:

sbatch --account=nn9970k --mem=1G --time=01:00 example-job.sh

Check what's queued or running with squeue -u $USER. To watch it refresh: squeue -u $USER --iterate 5 (Ctrl-C to stop).

Mandatory flags differ per cluster — Betzy also needs --nodes and a partition. The job script generator is the quickest way to get a working starter for each cluster.

Read the canonical section →
- Select activity Quiz: Your first job
  
  Quiz: Your first job
  
  Students must
  
  Mark as done
  
  Submitting, queueing, finding the output.
Select section Inspect the output

Collapse Expand
Inspect the output
When a batch job runs, its stdout and stderr are captured into a file in the directory you submitted from, named slurm-<JOBID>.out by default.

$ ls example-job.sh slurm-14738683.out

Inside that file you'll find: the node the job ran on, your script's own output, accounting (CPU-hours consumed against the project), wallclock vs CPU time, exit status, and a memory summary.

Wallclock time is real elapsed time. CPU time is the time the CPU was actively running your code. A 2-minute wallclock job that mostly waits on I/O might only consume a few seconds of CPU; a 2-minute job using 4 cores at 100% consumes ~8 CPU-minutes.

If your script fails partway through, the error message lands in the same file — useful for post-mortem when you can't watch the job live.

Read the canonical section →
- Select activity Quiz: Inspect the output
  
  Quiz: Inspect the output
  
  Students must
  
  Mark as done
  
  What ends up in slurm-*.out, and what wallclock time means.
Select section Write a batch script (#SBATCH)

Collapse Expand
Write a batch script (#SBATCH)
Typing every flag on the command line gets old. Slurm reads #SBATCH comment lines at the top of a script as if they were command-line flags — the script becomes self-contained and reproducible.

#!/bin/bash #SBATCH --account=nn9970k #SBATCH --time=01:00 #SBATCH --mem=1G #SBATCH --job-name=hello #SBATCH --output=output-%j.txt #SBATCH --error=error-%j.txt ./example-job.sh

Then just sbatch myscript.sh — the flags inside the script take effect. Anything passed on the command line still wins over an in-script directive, so you can override for one-off runs.

Useful directive shortcuts:

%j in --output / --error filenames is replaced with the job ID.

--job-name= sets the name shown in squeue.

--mail-type=END,FAIL + --mail-user=… emails you when the job finishes or fails.

The quiz for this topic asks you to fill in a SBATCH script — pay attention to the unit format (Slurm accepts both 1G and 1024M for memory, both 01:00 and 1:00 for time).

Read the canonical section →
- Select activity Quiz: Write a batch script
  
  Quiz: Write a batch script
  
  Students must
  
  Mark as done
  
  Fill in the SBATCH directives. This is the load-bearing exercise for this topic.
Select section Resource requests

Collapse Expand
Resource requests
Four flags matter most of the time:

--time=<d-hh:mm:ss> — wallclock. Days part is optional; 30:00 means 30 minutes, 1:00:00 means 1 hour.

--mem=<value> — memory per node (1G, 4096M, etc.).

--ntasks=<N> — total CPU cores.

--nodes=<N> — how many distinct machines.

Two non-obvious truths:

Requesting more does not make your job faster. Asking for 32 cores when your script uses 1 just wastes the other 31 and makes you wait longer in the queue.

Exceeding what you asked for kills the job. If your job runs longer than --time, Slurm cancels it with "TIME LIMIT" — output up to that point is still in the .out file.

If you skip the flags entirely, you get the cluster's defaults, which are usually a tiny allocation that's only useful for "hello world" scripts.

Read the canonical section →
- Select activity Quiz: Resource requests
  
  Quiz: Resource requests
  
  Students must
  
  Mark as done
  
  Time formats, memory units, and what happens when you exceed them.
Select section Cancel + interactive jobs

Collapse Expand
Cancel + interactive jobs
To cancel a queued or running job, use scancel <JOBID>. The job ID is what sbatch printed when you submitted; you can also find it via squeue -u $USER.

$ scancel 38759 $ squeue -u $USER # (job is gone within a few seconds)

Sometimes you don't want a batch job — you want a shell on a compute node so you can poke at things interactively. Use salloc:

salloc --account=nn9970k --time=30:00 --ntasks=1 --mem=4G

The session lives as long as you stay connected. Drop your laptop's network for too long and the job dies. To survive disconnects, wrap your work in tmux on the login node before calling salloc — you can re-attach with tmux attach after reconnecting.

One catch on NRIS clusters: each tmux session is bound to the specific login node where you started it. If you ssh in and get a different login node, ssh again specifying the one your session is on (ssh login-1).

Read the canonical section →
- Select activity Quiz: Cancel + interactive
  
  Quiz: Cancel + interactive
  
  Students must
  
  Mark as done
  
  scancel, salloc, and the tmux trick.
Select section Final assessment

Collapse Expand
Final assessment
One quiz that pulls everything together — including a multi-line SBATCH script you build from scratch.
- Select activity Final assessment
  
  Final assessment Quiz
  
  Students must
  
  Mark as done
  
  Everything together — including a complete SBATCH script you write from scratch.

Main content blocks

Section outline

What you'll learn

How the course is structured

Before you start