Job control and useful commands
Below are some nice to know commands for controlling and checking up on running or queued jobs.
Overall cluster status
This will normally show some colored bars for each partition, which unfortunately doesn't render here.
Cluster allocation summary per partition or individual nodes (-n).
(Numbers are reported in free/allocated/total(OS factor)).
Partition | CPUs | Memory (GB) | GPUs |
========================================================================================================
shared | 1436 196 /1632 (3x) | 2091 268 /2359 |
general | 395 373 /768 | 2970 731 /3701 |
high-mem | 233 199 /432 | 1803 1936 /3739 |
gpu | 24 40 /64 | 29 180 /209 | 1 1 /2
--------------------------------------------------------------------------------------------------------
Total: | 2088 808 /2896 | 6894 3115 /10009 | 1 1 /2
Jobs running/pending/total:
26 / 1 / 27
Use sinfo or squeue to obtain more details.
Get job status info
Use squeue
, for example:
$ squeue
squeue
JOBID NAME USER ACCOUNT TIME TIME_LEFT CPU MIN_ME ST PRIO PARTITION NODELIST(REASON)
1275175 RStudioServer user01@bio acc1 0:00 3-00:00:00 32 5G PD 4 general (QOSMaxCpuPerUserLimit)
1275180 sshdbridge user02@bio acc2 7:14 7:52:46 8 40G R 6 general bio-oscloud03
1275170 VirtualDesktop user03@bio acc2 35:54 5:24:06 2 10G R 6 general bio-oscloud05
To show only your own jobs use squeue --me
. This is used quite often so sq
has been made an alias of squeue --me
. You can for example also append --partition
, --nodelist
, --reservation
, and more, to only show the queue for those select partitions, nodes, or reservations.
You can also get an estimated start time for pending jobs by using squeue --start
. Jobs will in most cases start earlier than this time, as the calculation is based on the time limit set for current running jobs, and most finish in time. They can only start later if new jobs with a higher priority get submitted to the queue before they start:
$ squeue --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
1306529 general smk-mapp user01@b PD 2025-01-24T12:25:15 1 bio-oscloud02 (Resources)
1306530 general smk-mapp user01@b PD 2025-01-24T12:25:15 1 bio-oscloud04 (Priority)
1309386 general cq_EV_me user02@b PD 2025-01-24T12:25:15 1 bio-oscloud05 (Priority)
1306531 general smk-mapp user01@b PD 2025-01-24T12:27:02 1 bio-oscloud03 (Priority)
1306532 general smk-mapp user01@b PD 2025-01-25T14:46:56 1 (null) (Priority)
1306533 general smk-mapp user01@b PD 2025-01-25T14:46:56 1 (null) (Priority)
...
Job state codes (ST)
Status Code | Explaination |
---|---|
COMPLETED | CD |
COMPLETING | CG |
FAILED | F |
PENDING | PD |
PREEMPTED | PR |
RUNNING | R |
SUSPENDED | S |
STOPPED | ST |
A complete list can be found in SLURM's documentation
Job reason codes (REASON )
Reason Code | Explaination |
---|---|
Priority | One or more higher priority jobs is in queue for running. Your job will eventually run. |
Dependency | This job is waiting for a dependent job to complete and will run afterwards. |
Resources | The job is waiting for resources to become available and will eventually run. |
InvalidAccount | The job’s account is invalid. Cancel the job and rerun with correct account. |
InvaldQoS | The job’s QoS is invalid. Cancel the job and rerun with correct account. |
QOSGrpCpuLimit | All CPUs assigned to your job’s specified QoS are in use; job will run eventually. |
QOSGrpMaxJobsLimit | Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
QOSGrpNodeLimit | All nodes assigned to your job’s specified QoS are in use; job will run eventually. |
PartitionCpuLimit | All CPUs assigned to your job’s specified partition are in use; job will run eventually. |
PartitionMaxJobsLimit | Maximum number of jobs for your job’s partition have been met; job will run eventually. |
PartitionNodeLimit | All nodes assigned to your job’s specified partition are in use; job will run eventually. |
AssociationCpuLimit | All CPUs assigned to your job’s specified association are in use; job will run eventually. |
AssociationMaxJobsLimit | Maximum number of jobs for your job’s association have been met; job will run eventually. |
AssociationNodeLimit | All nodes assigned to your job’s specified association are in use; job will run eventually. |
A complete list can be found in SLURM's documentation
The columns to show can be customized using the --format
option, but can also be set with the environment variable SQUEUE_FORMAT
to avoid typing it every time. You can always override this to suit your needs in your .bashrc
file. The default format is currently:
See a full list here.
Prevent pending job from starting
Pending jobs can be marked in a "hold" state to prevent them from starting
To release a queued job from the ‘hold’ or 'requeued held' states:
To cancel and rerun (requeue) a particular job:
Cancel a job
With sbatch
you won't be able to just hit CTRL+c to stop what's running like you're used to in a terminal. Instead you must use scancel
. Get the job ID from squeue --me
, then use scancel
to cancel a running job, for example:
If the particular job doesn't stop and doesn't respond, consider using skill
instead.
Pause or resume a job
Use scontrol
to control your own jobs, for example suspend a running job:
Resume again with
Show details about a running or queued job
If needed, you can also obtain the batch script used to submit a job:
Modifying job attributes
Only a few job attributes can be changed after a job is submitted and NOT running yet. These attributes include:
- time limit
- job name
- job dependency
- partition or QOS
- nice value
For example:
$ scontrol update JobId=<jobid> timelimit=<new timelimit>
$ scontrol update JobId=<jobid> partition=high-mem
If the job is already running, adjusting the time limit must be done by an administrator.