Section outline

  • Welcome to HPC Scheduling — Slurm Basics. This course turns the canonical Sigma2 "hpc-intro" episode on the job scheduler into a guided pathway with short assessments after each topic.

    What you'll learn

    • Why HPC systems use a scheduler and what role Slurm plays.
    • How to submit a batch job, monitor it, and find the output.
    • How to write a reusable #SBATCH script with sensible resource requests.
    • How to cancel jobs and start an interactive session on a compute node.

    How the course is structured

    1. Each of the six content topics has a short reading followed by a quiz.
    2. The readings summarise the concept in ~300 words and link to the canonical Sigma2 episode for the full walkthrough.
    3. Quizzes use multiple-choice, short-answer, and "fill-in-the-script" Cloze questions. All grading is automatic.
    4. The Final Assessment in the last section is a complete SBATCH scenario — work through it once you've finished topics 1–6.

    Before you start

    You don't need cluster access to take the quizzes — every example is grader-checked against an expected string, not against a live Slurm queue. If you do have an NRIS account, follow along on Saga / Olivia / Betzy / Fox while you work.

    Source material based on Sigma2's hpc-intro tutorial (episode 13). The canonical lesson is linked at the bottom of every topic.

  • Why HPC systems need a scheduler, and what role Slurm plays.

  • The classic first job is a one-line script that prints the hostname of whatever node it lands on:

    #!/bin/bash
    echo -n "This script is running on "
    hostname

    Save it as example-job.sh, mark it executable (chmod +x), then submit with sbatch. On Saga the minimum required flags are --account, --time, --mem:

    sbatch --account=nn9970k --mem=1G --time=01:00 example-job.sh

    Check what's queued or running with squeue -u $USER. To watch it refresh: squeue -u $USER --iterate 5 (Ctrl-C to stop).

    Mandatory flags differ per cluster — Betzy also needs --nodes and a partition. The job script generator is the quickest way to get a working starter for each cluster.

    Read the canonical section →

    • Submitting, queueing, finding the output.

  • When a batch job runs, its stdout and stderr are captured into a file in the directory you submitted from, named slurm-<JOBID>.out by default.

    $ ls
    example-job.sh  slurm-14738683.out

    Inside that file you'll find: the node the job ran on, your script's own output, accounting (CPU-hours consumed against the project), wallclock vs CPU time, exit status, and a memory summary.

    Wallclock time is real elapsed time. CPU time is the time the CPU was actively running your code. A 2-minute wallclock job that mostly waits on I/O might only consume a few seconds of CPU; a 2-minute job using 4 cores at 100% consumes ~8 CPU-minutes.

    If your script fails partway through, the error message lands in the same file — useful for post-mortem when you can't watch the job live.

    Read the canonical section →

    • What ends up in slurm-*.out, and what wallclock time means.

  • Typing every flag on the command line gets old. Slurm reads #SBATCH comment lines at the top of a script as if they were command-line flags — the script becomes self-contained and reproducible.

    #!/bin/bash
    #SBATCH --account=nn9970k
    #SBATCH --time=01:00
    #SBATCH --mem=1G
    #SBATCH --job-name=hello
    #SBATCH --output=output-%j.txt
    #SBATCH --error=error-%j.txt
    
    ./example-job.sh

    Then just sbatch myscript.sh — the flags inside the script take effect. Anything passed on the command line still wins over an in-script directive, so you can override for one-off runs.

    Useful directive shortcuts:

    • %j in --output / --error filenames is replaced with the job ID.
    • --job-name= sets the name shown in squeue.
    • --mail-type=END,FAIL + --mail-user=… emails you when the job finishes or fails.

    The quiz for this topic asks you to fill in a SBATCH script — pay attention to the unit format (Slurm accepts both 1G and 1024M for memory, both 01:00 and 1:00 for time).

    Read the canonical section →

    • Fill in the SBATCH directives. This is the load-bearing exercise for this topic.

  • Four flags matter most of the time:

    • --time=<d-hh:mm:ss> — wallclock. Days part is optional; 30:00 means 30 minutes, 1:00:00 means 1 hour.
    • --mem=<value> — memory per node (1G, 4096M, etc.).
    • --ntasks=<N> — total CPU cores.
    • --nodes=<N> — how many distinct machines.

    Two non-obvious truths:

    1. Requesting more does not make your job faster. Asking for 32 cores when your script uses 1 just wastes the other 31 and makes you wait longer in the queue.
    2. Exceeding what you asked for kills the job. If your job runs longer than --time, Slurm cancels it with "TIME LIMIT" — output up to that point is still in the .out file.

    If you skip the flags entirely, you get the cluster's defaults, which are usually a tiny allocation that's only useful for "hello world" scripts.

    Read the canonical section →

    • Time formats, memory units, and what happens when you exceed them.

  • To cancel a queued or running job, use scancel <JOBID>. The job ID is what sbatch printed when you submitted; you can also find it via squeue -u $USER.

    $ scancel 38759
    $ squeue -u $USER
    # (job is gone within a few seconds)

    Sometimes you don't want a batch job — you want a shell on a compute node so you can poke at things interactively. Use salloc:

    salloc --account=nn9970k --time=30:00 --ntasks=1 --mem=4G

    The session lives as long as you stay connected. Drop your laptop's network for too long and the job dies. To survive disconnects, wrap your work in tmux on the login node before calling salloc — you can re-attach with tmux attach after reconnecting.

    One catch on NRIS clusters: each tmux session is bound to the specific login node where you started it. If you ssh in and get a different login node, ssh again specifying the one your session is on (ssh login-1).

    Read the canonical section →

    • scancel, salloc, and the tmux trick.

  • One quiz that pulls everything together — including a multi-line SBATCH script you build from scratch.

    • Everything together — including a complete SBATCH script you write from scratch.