|
|
Line 1: |
Line 1: |
− | ==Introduction== | + | == Introduction == |
| + | The batch system is a load distribution implementation that ensures convenient and fair use of a shared resource. Submitting jobs to a batch system allows a user to reserve specific resources with minimal interference to other users. All users are required to submit resource-intensive processing to the compute nodes through the batch system - <font color=red> attempting to circumvent the batch system is not allowed.</font> |
| | | |
− | The batch system, also commonly known as the '''job scheduler''', comprises software and structures (e.g., queues) through which jobs from all users
| + | On Ada, '''LSF''' is the batch system that provides job management. Jobs written in other batch system formats must be translated to LSF in order to be used on Ada. The [[HPRC:Batch_Translation | Batch Translation Guide]] offers some assistance for translating between batch systems that TAMU HPRC has previously used. |
− | are scheduled for execution. That is, jobs are submitted to the Ada cluster for execution in a sequence determined by a set of criteria, some of
| |
− | which are configured by the system administrators to fulfill policy goals, while the rest are specified by the user to satisfy various job execution
| |
− | requirements (e.g., number of cores, memory size per process).
| |
| | | |
− | More specifically, the job scheduler handles three key tasks:
| + | [[Category:Ada]] |
− | | |
− | * allocating the computing resources requested for a job
| |
− | * running the job, and
| |
− | * reporting back to the user the outcome of the execution
| |
− | | |
− | To run a job, a user must at a minimum carry out the following:
| |
− | | |
− | * prepare a job file (or job script), and
| |
− | * submit the job file to execution
| |
− | | |
− | On Ada, '''LSF''' (Load Sharing Facility) is the batch system that provides for overall job management. Below we take up the subject in more detail. Before we do
| |
− | that, we very briefly review the number of compute nodes of each type along with the amount of memory and number of GPUs or Phi's per node. The node types,
| |
− | as per LSF, are in bold. You will need this information to form correct job files.
| |
− | | |
− | '''Compute nodes and their types available to LSF:'''
| |
− | | |
− | * 792 NeXtScale ('''nxt'''): 20-core with 64GB memory
| |
− | * 26 iDataPlex ('''mem256gb'''): 20-core with 256GB memory
| |
− | * 11 x3850x5 ('''mem1tb'''): 40-core with 1TB memory
| |
− | * 4 x3850x5 ('''mem2tb'''): 40-core with 2TB memory
| |
− | * 9 iDataPlex ('''phi'''): 20-core with 64GB of memory and 2 Phi coprocessors
| |
− | * 30 iDataPlex ('''gpu'''): 20-core with 2 GPUs and 64GB or 256GB memory
| |
− | * 20 iDataPlex ('''gpu256gb'''): 20-core with 2 GPUs and 256GB memory. These are a portion of the above 30.
| |
− | | |
− | Under LSF's current configuration, the above node types describe the great majority of them. Nodes have node names that
| |
− | are mostly a concatenation of the type and a numeric identifier: nxt1238, nxt1454, lrg256-3002 (for mem256gb types), etc. Please refer to
| |
− | [[Ada:Intro | Hardware Summary]] section for more information on node types. | |
− | | |
− | '''Terminology.''' LSF uses the terms: <br>
| |
− | # '''host''' to refer, within Ada, to what is frequently meant as node, that is, a shared-memory multi-core system
| |
− | # '''processor or cpu''' to refer to what we will always call in this guide '''core'''
| |
− | # '''job slot''' to mean, in effect, available core. So, 3 available job slots refers to 3 available cores.
| |
− | # '''task or mpi process''' to mean the same thing, an mpi process.
| |
− | | |
− | We use the term:
| |
− | | |
− | # '''iDataPlex node''' to mean a node that has one of the following: GPUs, Phi's, or 256 GB of memory. All such nodes are housed in a single iDataPlex rack.
| |
− | | |
− | | |
− | There are many LSF commands, and many of them have many options. Here we focus on the most common
| |
− | and most useful. For details you can always resort to man pages and LSF's user manual, http://sc.tamu.edu/softwareDocs/lsf/9.1.2/
| |