Some Slurm Commands
scontrol show partition
Use this command to check the configuration of your partition. It tells you things like how many CPUs are available or if the configuration you put in place is working the way you had planned.
scontrol show config
If you need a little bit more details, this command will spit out the configuration bits as they are in place in the cluster. Nice dump of data to pipe grep through.
sinfo -N 1
This is handy to check the state of each of the nodes.
Show the queue. Find your JOBIDs. See how your job is moving up in the line
Use this command to cancel a job that you don’t want or need to start over. “45” is the JOBID.
scontrol update nodename=node02 state=resume
If you see that a node is offline, use this command to restart it. This example restarts “node02”. Adjust accordingly.
Isn’t really a slurm command, per se, but handy to check that slurm is actually running. If it’s not, don’t overlook the logs:
more /var/log/slurm/slurm.log more /var/log/slurm/slurmctld.log