Optimizing CPU efficiency

How many resources should I request for my job(s)?

Exactly how many resources (CPUs, memory, time, GPUs) your job(s) need(s) is something you have to experiment with and learn over time based on past experience. It's important to do a bit of experimentation before submitting large jobs to obtain a qualified guess since the utilization of all the allocated resources across the cluster is ultimately based on people's own assessments alone. Below are some tips regarding CPU and memory, but in the end please always only request what you need, and no more. This is essential to optimize the resource utilization and efficiency of the entire cluster, leaving no CPUs idle when everything is fully allocated.

Always inspect and optimize efficiency for next time!

When your job completes or fails, !!!ALWAYS!!! inspect the CPU and memory usage of the job in either the notification email received or using these commands and adjust the next job accordingly! This is essential to avoid wasting resources which other people could have used.

CPUs/threads

In general the number of CPUs that you book only affects how long the job will take to finish and how many jobs can run concurrently. The only thing to really consider is how many CPUs you want to use for the particular job out of your max limit (see Usage accounting for how to see the current limits). If you use all CPUs for one job, you can't start more jobs until the first one has finished, the choice is yours. But regardless, it's very important to ensure that your jobs actually fully utilize the allocated number of CPUs, so don't start a job with 20 allocated CPUs if you only set max threads for a certain tool to 10, for example. It also depends very much on the specific software tools you use for the individual steps in a workflow and how they are implemented, so you are not always in full control of the utilization. Furthermore, if you run a workflow with many different steps each using different tools, they will likely not use resources in the same way, and some may not even support multithreading at all (like R, depending on the packages used) and thus only run in a single single-threaded process, for example. In this case it might be a good idea to either split the job into multiple jobs if they run for a long time, or use workflow tools that support cluster execution, for example Snakemake where you can define separate resource requirements for individual steps. This is also the case for memory usage.

Over-subscription

Sometimes there is just no way around it, and if you don't expect your job(s) to be very efficient, please submit to the over-subscribed shared partition, which is also the default. Over-subscription simply means that SLURM will allocate more CPU's than available on each machine, so that more than one job will run on each CPU, ensuring that each physical CPU is actually utilized 100% and thus more people are happy!

Use `nproc` everywhere

The number of CPUs is not a hard limit like the physical amount of memory is, on the other hand, and SLURM will never exceed the maximum physical memory of each compute node. Instead jobs are killed if they exceed the allocated amount of memory for the job (only if no other jobs need the memory), or not be allowed to start in the first place. With CPUs the processes you run simply won't be able to detect any more CPUs than those allocated, hence it's handy to just use nproc within scripts to detect the number of available CPUs instead of manually setting a value for each tool. Furthermore, if you request more memory per CPU that the max allowed for the particular partition (refer to hardware partitions), SLURM will automatically allocate more CPU's for the job, and hence, again, it's a good idea to detect the number of CPU's dynamically using nproc everywhere.

Memory

Requesting a sensible maximum amount of memory is important to avoid crashing jobs. It's generally best to allocate more memory than what you need, so that the job doesn't crash and the spent resources don't go to waste and could have been used for something else anyways. To obtain a qualified guess you can start the job based on an initial expectation, and then set a job time limit of maybe 5-10 minutes just to see if it might crash due to exceeding the allocated memory, and if not you will see the maximum memory usage for the job in the email notification report (or use seff <jobid>). Then adjust accordingly and submit again with a little extra than what was used at maximum. Different steps of a workflow will in many cases, unavoidably, need more memory than others, so again, if they run for a long time, split it into multiple jobs or use Snakemake.

Our compute nodes have plenty of memory, but some tools require lots of memory. If you know that your job is going to use a lot of memory (per CPU that is), you should likely submit the job to the high-mem partition. In order to fully utilize each compute node a general rule of thumb is to:

Rule of thumb for optimal memory usage

Request a maximum of 5GB per CPU (1TB / 192t) if submitting jobs to the general partition. If you need more than that submit to the high-mem partition instead. If you request more memory per CPU than allowed for a particular partition, SLURM will automatically allocate more CPUs to scale accordingly, details here. It is therefore ideal to detect the number of CPUs available dynamically in your workflows using for example nproc.

If you know that you are almost going to fully saturate the memory on a compute node (depending on partition), you might as well also request more CPUs up to the total of a single compute node, since your job will likely allocate a full compute node alone, and CPU's can end up idle, while you could have finished the job faster. If needed you can also submit directly to the individual compute nodes specifically using the nodelist option (and potentially also --exclusive), refer to hardware partitions for hostnames and compute node specs.

Also keep in mind that the effective amount of memory available to SLURM jobs is less than what the physical machines have available because they are virtual machines running on a hypervisor OS that also needs some memory. A 1 TB machine roughly has 950 GB available and the 2 TB ones have 1.9 TB. See sinfo -N -l for details of each node.