Hprc banner tamu.png

Ada:Batch Introduction

From TAMU HPRC
Revision as of 11:27, 3 February 2016 by Francis (talk | contribs) (Introduction)
Jump to: navigation, search

Introduction

The batch system, also commonly known as the job scheduler, comprises software and structures (e.g., queues) through which jobs from all users are scheduled for execution. That is, jobs are submitted to the Ada cluster for execution in a sequence determined by a set of criteria, some of which are configured by the system administrators to fulfill policy goals, while the rest are specified by the user to satisfy various job execution requirements (e.g., number of cores, memory size per process).

More specifically, the job scheduler handles three key tasks:

  • allocating the computing resources requested for a job
  • running the job, and
  • reporting back to the user the outcome of the execution

To run a job, a user must at a minimum carry out the following:

  • prepare a job file (or job script), and
  • submit the job file to execution

On Ada, LSF (Load Sharing Facility) is the batch system that provides for overall job management. Below we take up the subject in more detail. Before we do that, we very briefly review the number of compute nodes of each type along with the amount of memory and number of GPUs or Phi's per node. The node types, as per LSF, are in bold. You will need this information to form correct job files.

Compute nodes and their types available to LSF:

  • 792 NeXtScale (nxt): 20-core with 64GB memory
  • 26 iDataPlex (mem256gb): 20-core with 256GB memory
  • 11 x3850x5 (mem1tb): 40-core with 1TB memory
  • 4 x3850x5 (mem2tb): 40-core with 2TB memory
  • 9 iDataPlex (phi): 20-core with 64GB of memory and 2 Phi coprocessors
  • 30 iDataPlex (gpu): 20-core with 2 GPUs and 64GB or 256GB memory
  • 20 iDataPlex (gpu256gb): 20-core with 2 GPUs and 256GB memory. These are a portion of the above 30.

Under LSF's current configuration, the above node types describe the great majority of them. Nodes have node names that are mostly a concatenation of the type and a numeric identifier: nxt1238, nxt1454, lrg256-3002 (for mem256gb types), etc. Please refer to Hardware Summary section for more information on node types.

Terminology. LSF uses the terms:

  1. host to refer, within Ada, to what is frequently meant as node, that is, a shared-memory multi-core system
  2. processor or cpu to refer to what we will always call in this guide core
  3. job slot to mean, in effect, available core. So, 3 available job slots refers to 3 available cores.
  4. task or mpi process to mean the same thing, an mpi process.

We use the term:

  1. iDataPlex node to mean a node that has one of the following: GPUs, Phi's, or 256 GB of memory. All such nodes are housed in a single iDataPlex rack.


There are many LSF commands, and many of them have many options. Here we focus on the most common and most useful. For details you can always resort to man pages and LSF's user manual, http://sc.tamu.edu/softwareDocs/lsf/9.1.2/