Job control and useful commands
Below are some nice to know commands for controlling and checking up on jobs, current and past.
Get job status info
Use squeue
, for example:
$ squeue
JOBID NAME USER TIME TIME_LEFT CPU MIN_ME ST PARTITION NODELIST(REASON)
2380 dRep ab12cd@bio 1-01:36:22 12-22:23:38 80 300G R general bio-oscloud02
To show only your own jobs use squeue --me
. This is used quite often so sq
has been made an alias of squeue --me
. You can for example also append --partition
, --nodelist
, --reservation
, and more to only show the queue for those select partitions, nodes, or reservation.
Job state codes (ST)
Status Code | Explaination |
---|---|
COMPLETED | CD |
COMPLETING | CG |
FAILED | F |
PENDING | PD |
PREEMPTED | PR |
RUNNING | R |
SUSPENDED | S |
STOPPED | ST |
A complete list can be found in SLURM's documentation
Job reason codes (REASON )
Reason Code | Explaination |
---|---|
Priority | One or more higher priority jobs is in queue for running. Your job will eventually run. |
Dependency | This job is waiting for a dependent job to complete and will run afterwards. |
Resources | The job is waiting for resources to become available and will eventually run. |
InvalidAccount | The job’s account is invalid. Cancel the job and rerun with correct account. |
InvaldQoS | The job’s QoS is invalid. Cancel the job and rerun with correct account. |
QOSGrpCpuLimit | All CPUs assigned to your job’s specified QoS are in use; job will run eventually. |
QOSGrpMaxJobsLimit | Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
QOSGrpNodeLimit | All nodes assigned to your job’s specified QoS are in use; job will run eventually. |
PartitionCpuLimit | All CPUs assigned to your job’s specified partition are in use; job will run eventually. |
PartitionMaxJobsLimit | Maximum number of jobs for your job’s partition have been met; job will run eventually. |
PartitionNodeLimit | All nodes assigned to your job’s specified partition are in use; job will run eventually. |
AssociationCpuLimit | All CPUs assigned to your job’s specified association are in use; job will run eventually. |
AssociationMaxJobsLimit | Maximum number of jobs for your job’s association have been met; job will run eventually. |
AssociationNodeLimit | All nodes assigned to your job’s specified association are in use; job will run eventually. |
A complete list can be found in SLURM's documentation
The columns to show can be customized using the --format
option, but can also be set with the environment variable SQUEUE_FORMAT
to avoid typing it every time. You can always override this to suit your needs in your .bashrc
file. The default format is currently:
See a full list here.
Prevent pending job from starting
Pending jobs can be marked in a "hold" state to prevent them from starting
To release a queued job from the ‘hold’ or 'requeued held' states:
To cancel and rerun (requeue) a particular job:
Cancel a job
With sbatch
you won't be able to just hit CTRL+c to stop what's running like you're used to in a terminal. Instead you must use scancel
. Get the job ID from squeue --me
, then use scancel
to cancel a running job, for example:
If the particular job doesn't stop and doesn't respond, consider using skill
instead.
Pause or resume a job
Use scontrol
to control your own jobs, for example suspend a running job:
Resume again with
Show details about a running or queued job
Modifying job attributes
Only a few job attributes can be changed after a job is submitted and NOT running yet. These attributes include:
- wall clock limit
- job name
- job dependency
- partition or QOS
For example:
$ scontrol update JobId=<jobid> timelimit=<new timelimit>
$ scontrol update JobId=<jobid> partition=high-mem
If the job is already running adjusting the time limit must be done by an administrator.