Terra:Batch Queues
Contents
Queues
Upon job submission, Slurm sends your jobs to appropriate batch queues. These are (software) service stations configured to control the scheduling and dispatch of jobs that have arrived in them. Batch queues are characterized by all sorts of parameters. Some of the most important are:
- The total number of jobs that can be concurrently running (number of run slots)
- The wall-clock time limit per job
- The type and number of nodes it can dispatch jobs to
These settings control whether a job will remain idle in the queue or be dispatched quickly for execution.
The current queue structure is: (updated on January 29, 2020).
Queue | Job Max Cores / Nodes | Job Max Walltime | Compute Node Types | Per-User Limits Across Queues | Notes |
---|---|---|---|---|---|
short | 448 cores / 16 nodes | 30 min / 2 hr | 64 GB nodes (256) | 1800 Cores per User | |
medium | 1792 cores / 64 nodes | 1 day | |||
long | 896 cores / 32 nodes | 7 days | 64 GB nodes (256) | ||
xlong | 448 cores / 16 nodes | 21 days | 64 GB nodes (256) | 448 cores per User | For jobs needing to run longer than 7 days.
Submit jobs to this partition with the --partition xlong option. |
gpu | 1344 cores / 48 nodes | 3 days | 128 GB nodes with GPUs (48) | For jobs requiring a GPU or more than 64 GB of memory. | |
vnc | 28 cores / 1 node | 12 hours | 128 GB nodes with GPUs (48) | For jobs requiring remote visualization. | |
knl | 68 cores / 8 nodes 72 cores / 8 nodes |
7 days | 96 GB nodes with KNL processors (8) | For jobs requiring a KNL. |
Checking queue usage
The following command can be used to get information on queues and their nodes.
[NetID@terra1 ~]$ sinfo
Example output:
PARTITION AVAIL TIMELIMIT JOB_SIZE NODES(A/I/O/T) CPUS(A/I/O/T) short* up 2:00:00 1-16 244/12/0/256 5333/1835/0/7168
Note: A/I/O/T stands for Active, Idle, Offline, and Total
Checking node usage
The following command can be used to generate a list of nodes and their corresponding information, including their CPU usage.
[NetID@terra1 ~]$ pestat
Example output:
Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist State Use/Tot (MB) (MB) JobId User ... knl-0101 knl drain$ 0 68 0.00* 88000 0
Checking bad nodes
The following command can be used to view a current list of bad nodes on the machine:
[NetID@terra1 ~]$ bad_nodes.sh
The following output is just an example output and users should run bad_nodes.sh not see a current list.
Example output:
% bad_nodes.sh REASON USER TIMESTAMP STATE NODELIST The system board OCP1 PG voltage is outside of range. root 2022-07-11T14:38:07 drained fc152 FPGA preparation in progress root 2022-07-12T15:57:01 drained* fc[125-126] investigating memverge license issue francis 2022-08-09T14:15:05 drained fc032 investigating unknown memverge issue francis 2022-08-09T14:15:19 drained fc033 fabric 1 hardware failure francis 2022-08-15T13:52:10 drained* fc[001-006,008,039-040]
Checkpointing
Checkpointing is the practice of creating a save state of a job so that, if interrupted, it can begin again without starting completely over. This technique is especially important for long jobs on the batch systems, because each batch queue has a maximum walltime limit.
A checkpointed job file is particularly useful for the gpu queue, which is limited to 4 days walltime due to its demand. There are many cases of jobs that require the use of gpus and must run longer than two days, such as training a machine learning algorithm.
Users can change their code to implement save states so that their code may restart automatically when cut off by the wall time limit. There are many different ways to checkpoint a job file depending on the software used, but it is almost always done at the application level. It is up to the user how frequently save states are made depending on what kind of fault tolerance is needed for the job, but in the case of the batch system, the exact time of the 'fault' is known. It's just the walltime limit of the queue. In this case, only one checkpoint need be created, right before the limit is reached. Many different resources are available for checkpointing techniques. Some examples for common software are listed below.