One of the first stumbling blocks of learning to use an HPC (High-Performance Cluster) is learning to submit a job to the queue. That job will be run by the nodes in the queue, not on the computer where you ran it. The SBATCH script is sort of like a shell script with a whole lot of comments in it. Those comment lines or lines that start with #SBATCH are actually the slurm batch commands and set up the environment in which to run your job when it comes up in the queue.
Let’s look at a simple sbatch script:
#!/bin/bash ##Comments are double-hashed #SBATCH --job-name=primes #SBATCH --partition=picluster #SBATCH --output=/work/primes-%j.out #SBATCH --error=/work/error-%j.out #SBATCH --nodes=4 srun python3 /work/primes.py
Start with a shebang line. This IS a script, remember? Next, is the first sbatch directive. It just gives your job a name, so you can find it in the queue. The second command tells the scheduler in what partition to run your job. You probably only have one queue, but most clusters have multiple partitions, some with GPUs or other more valuable resources that the average user might not have access to.
output and error are the standard IO operators. Since this is a batch job, you are going to want your output to be saved to a file. Same with any errors that might arise–good to capture for troubleshooting.
nodes=4 tells the scheduler to spread your job across four nodes. Otherwise, your job will run on the first four available cores. In my cluster, that would mean that the job would only run on one of the raspberries, consuming its four-core CPU. I’d rather spread the work out over multiple CPUs, so the first Raspberry doesn’t get stuck with all the work and overheat!
Lastly, the srun command! This is the line that runs the actual script! Mine is a python script that is in my /work NFS mount, that is shared between all the nodes. You’ll notice, too, that I pointed the output to files in that same directory.
What are some sbatch directives that YOU like to include in YOUR submit scripts?