Hprc banner tamu.png

Difference between revisions of "Ada:Batch Introduction"

From TAMU HPRC
Jump to: navigation, search
(Introduction)
(Introduction)
 
Line 1: Line 1:
==Introduction==
+
== Introduction ==
 +
The batch system is a load distribution implementation that ensures convenient and fair use of a shared resource. Submitting jobs to a batch system allows a user to reserve specific resources with minimal interference to other users. All users are required to submit resource-intensive processing to the compute nodes through the batch system - <font color=red> attempting to circumvent the batch system is not allowed.</font>
  
The batch system, also commonly known as the '''job scheduler''', comprises software and structures (e.g., queues) through which jobs from all users
+
On Ada, '''LSF''' is the batch system that provides job management. Jobs written in other batch system formats must be translated to LSF in order to be used on Ada. The [[HPRC:Batch_Translation | Batch Translation Guide]] offers some assistance for translating between batch systems that TAMU HPRC has previously used.
are scheduled for execution. That is, jobs are submitted to the Ada cluster for execution in a sequence determined by a set of criteria, some of
 
which are configured by the system administrators to fulfill policy goals, while the rest are specified by the user to satisfy various job execution
 
requirements (e.g., number of cores, memory size per process).
 
  
More specifically, the job scheduler handles three key tasks:
+
[[Category:Ada]]
 
 
* allocating the computing resources requested for a job
 
* running the job, and
 
* reporting back to the user the outcome of the execution
 
 
 
To run a job, a user must at a minimum carry out the following:
 
 
 
* prepare a job file (or job script), and
 
* submit the job file to execution
 
 
 
On Ada, '''LSF''' (Load Sharing Facility) is the batch system that provides for overall job management. Below we take up the subject in more detail. Before we do
 
that, we very briefly review the number of compute nodes of each type along with the amount of memory and number of GPUs or Phi's per node. The node types,
 
as per LSF, are in bold. You will need this information to form correct job files.
 
 
 
'''Compute nodes and their types available to LSF:'''
 
 
 
* 792 NeXtScale ('''nxt'''): 20-core with 64GB memory
 
* 26  iDataPlex ('''mem256gb'''):  20-core with 256GB memory
 
* 11  x3850x5 ('''mem1tb'''):  40-core with 1TB memory
 
*  4    x3850x5 ('''mem2tb'''):  40-core with 2TB memory
 
*  9    iDataPlex ('''phi'''):        20-core with 64GB of memory and 2 Phi coprocessors
 
* 30  iDataPlex ('''gpu'''):        20-core with 2 GPUs and 64GB or 256GB memory
 
* 20  iDataPlex ('''gpu256gb'''):  20-core with 2 GPUs and 256GB memory. These are a portion of the above 30.
 
 
 
Under LSF's current configuration, the above node types describe the great majority of them. Nodes have node names that
 
are mostly a concatenation of the type and a numeric identifier: nxt1238, nxt1454, lrg256-3002 (for mem256gb types), etc. Please refer to
 
[[Ada:Intro | Hardware Summary]] section for more information on node types.
 
 
 
'''Terminology.''' LSF uses the terms: <br>
 
# '''host''' to refer, within Ada, to what is frequently meant as node, that is, a shared-memory multi-core system
 
# '''processor or cpu''' to refer to what we will always call in this guide '''core'''
 
# '''job slot''' to mean, in effect, available core. So, 3 available job slots refers to 3 available cores.
 
# '''task or mpi process''' to mean the same thing, an mpi process.
 
 
 
We use the term:
 
 
 
# '''iDataPlex node''' to mean a node that has one of the following: GPUs, Phi's, or 256 GB of memory. All such nodes are housed in a single iDataPlex rack.
 
 
 
 
 
There are many LSF commands, and many of them have many options. Here we focus on the most common
 
and most useful. For details you can always resort to man pages and LSF's user manual, http://sc.tamu.edu/softwareDocs/lsf/9.1.2/
 

Latest revision as of 17:31, 6 January 2017

Introduction

The batch system is a load distribution implementation that ensures convenient and fair use of a shared resource. Submitting jobs to a batch system allows a user to reserve specific resources with minimal interference to other users. All users are required to submit resource-intensive processing to the compute nodes through the batch system - attempting to circumvent the batch system is not allowed.

On Ada, LSF is the batch system that provides job management. Jobs written in other batch system formats must be translated to LSF in order to be used on Ada. The Batch Translation Guide offers some assistance for translating between batch systems that TAMU HPRC has previously used.