Running a Raspberry Pi HPC Cluster

Some Slurm Commands

scontrol show partition

Use this command to check the configuration of your partition. It tells you things like how many CPUs are available or if the configuration you put in place is working the way you had planned.

scontrol show config

If you need a little bit more details, this command will spit out the configuration bits as they are in place in the cluster. Nice dump of data to pipe grep through.

sinfo -N 1

This is handy to check the state of each of the nodes.

squeue

Show the queue. Find your JOBIDs. See how your job is moving up in the line

scancel 45

Use this command to cancel a job that you don’t want or need to start over. “45” is the JOBID.

scontrol update nodename=node02 state=resume

If you see that a node is offline, use this command to restart it. This example restarts “node02”. Adjust accordingly.

service --status-all

Isn’t really a slurm command, per se, but handy to check that slurm is actually running. If it’s not, don’t overlook the logs:

more /var/log/slurm/slurm.log
more /var/log/slurm/slurmctld.log

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.