Hprc banner tamu.png

Ada:Batch Processing LSF

From TAMU HPRC
Revision as of 17:29, 6 January 2017 by Cryssb818 (talk | contribs)
Jump to: navigation, search

Ada Batch Processing: LSF

Introduction

The batch system is a load distribution implementation that ensures convenient and fair use of a shared resource. Submitting jobs to a batch system allows a user to reserve specific resources with minimal interference to other users. All users are required to submit resource-intensive processing to the compute nodes through the batch system - attempting to circumvent the batch system is not allowed.

On Ada, LSF is the batch system that provides job management. Jobs written in other batch system formats must be translated to LSF in order to be used on Ada. The Batch Translation Guide offers some assistance for translating between batch systems that TAMU HPRC has previously used.

Building Job files

While not the only method of submitted programs to be executed, job files fulfill the needs of most users.

Job files consist of two main parts:

  • Resource Specifications
  • Executable commands

In a job file, resource specification options are preceded by a script directive. For each batch system, this directive is different. On Ada (LSF) this directive is #BSUB.
For every line of resource specifications, this directive must be the first text of the line, and all specifications must come before any executable lines. An example of a resource specification is given below:

#BSUB -J MyExample  #Set the job name to "MyExample"

Note: Comments in a job file also begin with a # but LSF recognizes #BSUB as a directive.

A list of the most commonly used and important options for these job files are given in the following section of this wiki. Full job file examples are given below.

Basic Job Specifications

Several of the most important options are described below. These basic options are typically all that is needed to run a job on Ada.

Basic Ada (LSF) Job Specifications
Specification Option Example Example-Purpose
Job Name -J [SomeText] -J MyJob1 Set the job name to "MyJob1"
Shell -L [Shell] -L /bin/bash Uses the bash shell to initialize
the job's execution environment.
Wall Clock Limit -W [hh:mm] -W 1:15 Set wall clock limit to 1 hour 15 min
Core count -n ## -n 20 Assigns 20 job slots/cores.
Cores per node -R "span[ptile=##]" -R "span[ptile=5]" Request 5 cores per node.
Memory Per Core -M [MB] -M 2560 Sets the per process memory limit to 2560 MB.
Memory Per Core -R "rusage[mem=[MB]]" -R "rusage[mem=2560]" Schedules job on nodes that have at
least 2560 MBs available per core.
Combined stdout and stderr -o [OutputName].%j -o stdout1.%j Collect stdout/err in stdout.[JobID]

Optional Job Specifications

A variety of optional specifications are available to customize your job. The table below lists the specifications which are most useful for users of Ada.

Optional Ada (LSF) Job Specifications
Specification Option Example Example-Purpose
Set Allocation -P ###### -P 274839 Set allocation to charge to 274839
Email Notification I -u [email-address] -u howdy@tamu.edu Send emails to howdy@tamu.edu.
Email Notification II -[B|N] -B -N Send email on beginning (-B) and end (-N) of job.
Specify Queue -q [queue] -q xlarge Request only nodes in xlarge subset.
Exclusive Node Usage -x Assigns a whole node exclusively for the job.
Specific node type -R "select[gpu|phi]" -R "select[gpu]" Requests a node with a GPU to be used for the job.

Environment Variables

All the nodes enlisted for the execution of a job carry most of the environment variables the login process created: HOME, SCRATCH, PWD, PATH, USER, etc. In addition, LSF defines new ones in the environment of an executing job. Below is a list of most commonly used environment variables.

Basic Ada (LSF) Environment Variables
Variable Usage Description
Job ID $LSB_JOBID Batch job ID assigned by LSF.
Job Name $LSB_JOBNAME The name of the Job.
Queue $LSB_QUEUE The name of the queue the job is dispatched from.
Error File $LSB_ERRORFILE Name of the error file specified with a bsub -e.
Submit Directory $LSB_SUBCWD The directory the job was submitted from.
Hosts I $LSB_HOSTS The list of nodes that are used to run the batch job,
repeated according to ptile value.
*The character limit of LSB_HOSTS variable is 4096.
Hosts II $LSB_MCPU_HOSTS The list of nodes and the specified or default
ptile value per node to run the batch job.
Host file $LSB_DJOB_HOSTFILE The hostfile containing the list of nodes
that are used to run the batch job.

Note: To see all relevant LSF environment variables for a job, add the following line to the executable section of a job file and submit that job. All the variables will be printed in the output file.

env | grep LSB

Clarification on Memory, Core, and Node Specifications

Memory Specifications are IMPORTANT.
For examples on calculating memory, core, and/or node specifications on Ada: Specification Clarification.

Executable Commands

After the resource specification section of a job file comes the executable section. This executable section contains all the necessary UNIX, Linux, and program commands that will be run in the job.
Some commands that may go in this section include, but are not limited to:

  • Changing directories
  • Loading, unloading, and listing modules
  • Launching software

An example of a possible executable section is below:

cd $SCRATCH      # Change current directory to /scratch/user/[netID]/
ml purge         # Purge all modules
ml intel/2016b   # Load the intel/2016b module
ml               # List all currently loaded modules

./myProgram.o    # Run "myProgram.o"

For information on the module system or specific software, visit our Modules page and our Software page.

Job Submission

Once you have your job file ready, it is time to submit your job. You can submit your job to LSF with the following command:

[ NetID@ada ~]$ bsub < MyJob.LSF
Verifying job submission parameters...
Verifying project account...
     Account to charge:   123456789123
         Balance (SUs):      5000.0000
         SUs to charge:         5.0000
Job <12345> is submitted to default queue <sn_regular>.

tamubatch

tamubatch is an automatic batch job script that submits jobs for the user without the need of writing a batch script on the Ada and Terra clusters. The user just needs to provide the executable commands in a text file and tamubatch will automatically submit the job to the cluster. There are flags that the user may specify which allows control over the parameters for the job submitted.

tamubatch is still in beta and has not been fully developed. Although there are still bugs and testing issues that are currently being worked on, tamubatch can already submit jobs to both the Ada and Terra clusters if given a file of executable commands.

For more information, visit this page.

tamulauncher

tamulauncher provides a convenient way to run a large number of serial or multithreaded commands without the need to submit individual jobs or a Job array. User provides a text file containing all commands that need to be executed and tamulauncher will execute the commands concurrently. The number of concurrently executed commands depends on the batch requirements. When tamulauncher is run interactively the number of concurrently executed commands is limited to at most 8. tamulauncher is available on terra, ada, and curie. There is no need to load any module before using tamulauncher. tamulauncher has been successfully tested to execute over 100K commands.

tamulauncher is preferred over Job Arrays to submit a large number of individual jobs, especially when the run times of the commands are relatively short. It allows for better utilization of the nodes, puts less burden on the batch scheduler, and lessens interference with jobs of other users on the same node.

For more information, visit this page.

More Ada Examples

The following five job scripts with corresponding program source, illustrate a common variety of computation: serial, OpenMP threads, MPI, MPI-OpenMP hybrid, and MPMD (Multiple-Program-Multiple-Data). Observe the relationship of the different resource (-R) options and settings, but especially note the effect of the ptile setting. We use the old standby, helloWorld, program/codelet, each time in the guise of the appropriate programming model, because its simplicity allows us to focus better on the interaction between the parameters of batch and those of the programming models.

Example Job 1 (Serial)

The following job will be run on a single core, of any available node, barring those that have 1TB or 2TB of memory. The code(let) illustrates one way of capturing inside a program the value of an environment variable (here, LSB_HOSTS).

#BSUB -J serial_helloWorld 
#BSUB -L /bin/bash 
#BSUB -W 20
#BSUB -n 1 
#BSUB -R 'rusage[mem=150] span[ptile=1]' 
#BSUB -M 150
#BSUB -o serial_helloWorld.%J 

# Set up the environment
ml purge           # module abbreviated as ml
ml intel/2015B     # 2015B is the richest version on Ada, module load abbreviated as ml
ml                 # module list abbreviated as ml 

# Compile and run serial_helloWorld.exe
ifort -o serial_helloWorld.exe serial_helloWorld.f90
./serial_helloWorld.exe

Source code serial_helloWorld.f90

Program Serial_Hello_World
! By SC TAMU staff:  "upgraded" from 5 years ago for Ada
! ifort -o serial_helloWorld.exe serial_helloWorld.f90
! ./serial_helloWorld.exe
!------------------------------------------------------------------
character (len=20)  ::  host_name='LSB_HOSTS', host_name_val
integer   (KIND=4)  :: sz, status
!
call get_environment_variable (host_name, host_name_val, sz, status, .true.)
!
print *,'- Helloo World: node ', trim(adjustl(host_name_val)),' - '
!
end program Serial_Hello_World

Example Job 2 (OpenMP)

This job will run 20 OpenMP threads (OMP_NUM_THREADS=20) on 20 (-n 20) cores, all on the same node (ptile=20) .

#BSUB -J omp_helloWorld 
#BSUB -L /bin/bash 
#BSUB -W 20
#BSUB -n 20 
#BSUB -R 'rusage[mem=300] span[ptile=20]' 
#BSUB -M 300
#BSUB -o omp_helloWorld.%J 

# Set up environment
ml purge
ml intel/2015B
ml 

# Compile and run omp_helloWorld.exe
ifort -openmp -o omp_helloWorld.exe omp_helloWorld.f90
export OMP_NUM_THREADS=20          # Set number of OpenMP threads to 20
./omp_helloWorld.exe               # Run the program

Source code omp_helloWorld.f90

Program Hello_World_omp
! By SC TAMU staff:  "upgraded" from 5 years ago for Ada
! ifort -openmp -o omp_helloWorld.exe omp_helloWorld.f90
! ./omp_helloWorld.exe 
!---------------------------------------------------------------------
USE OMP_LIB
character (len=20)  :: host_name='LSB_HOSTS', host_name_val
integer   (KIND=4)  :: sz, status
!
character (len=4)   :: omp_id_str, omp_np_str
integer   (KIND=4)  :: omp_id, omp_np
!
call get_environment_variable (host_name, host_name_val, sz, status, .true.)
!
!$OMP PARALLEL PRIVATE(omp_id, omp_np, myid_str, omp_id_str, omp_np_str)
!
omp_id = OMP_GET_THREAD_NUM(); omp_np = OMP_GET_NUM_THREADS()
!
! Internal writes convert binary integers to numeric strings so that output
! from print is more tidy.
write (myid_str, '(I4)') myid; write(omp_id_str, '(I4)') omp_id
write (omp_np_str, '(I4)') omp_np
!
print *,'- Helloo World: node ', trim(adjustl(host_name_val)),' THREAD_ID ', &
trim(adjustl(omp_id_str)), ' out of ',trim(adjustl(omp_np_str)),' OMP threads -'
!
!$OMP END PARALLEL
!
end program Hello_World_omp

Example Job 3 (MPI)

Here the job runs an mpi program on 12 cores/jobslots (-n 12), across three different nodes (ptile=4). Note that in this case, the -np 12 setting on the mpi launcher, mpiexec.hydra, command must match the number of jobslots. mpiexec.hydra accepts -n and -np as the same thing. We opted to use the -np alias in order to avoid confusion with the -n of the BSUB directive.

#BSUB -J mpi_helloWorld 
#BSUB -L /bin/bash 
#BSUB -W 20
#BSUB -n 12 
#BSUB -R 'rusage[mem=150] span[ptile=4]' 
#BSUB -M 150
#BSUB -o mpi_helloWorld.%J 

# Set up environment
ml purge
ml intel/2015B
ml 

# Compile and run mpi_helloWorld.exe
mpiifort -o mpi_helloWorld.exe mpi_helloWorld.f90
mpiexec.hydra -np 12 ./mpi_helloWorld.exe  # The -np setting must match the number of job slots, Run the program

Source code mpi_helloWorld.f90

Program Hello_World_mpi
! By SC TAMU staff:  "upgraded" from 5 years ago for Ada
! mpiifort -o mpi_helloWorld.exe mpi_helloWorld.f90
! mpiexec.hydra -n 2 ./mpi_helloWorld.exe 
!----------------------------------------------------------------
USE MPI
character (len=MPI_MAX_PROCESSOR_NAME) host_name
character (len=4)   :: myid_str
integer   (KIND=4)  :: np, myid, host_name_len, ierr
!
call MPI_INIT(ierr)
if (ierr /= MPI_SUCCESS) STOP '-- MPI_INIT ERROR --'
!
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)
!
call MPI_GET_PROCESSOR_NAME(host_name, host_name_len, ierr) ! Returns node/host name
!
! Internal write to convert binary integer (myid) to numeric string so that print line is tidy.
write (myid_str, '(I4)') myid
!
print *,'- Helloo World: node ', trim(adjustl(host_name)),' MPI process # ', myid_str, ' -'
!
call MPI_FINALIZE(ierr)
!
end program Hello_World_mpi

Example Job 4 (MPI-OpenMP Hybrid)

This job runs an MPI-OpenMP program on 8 jobslots with 4 of them allocated per node. That is, the job will run on 2 nodes, one mpi process per node. The latter is accomplished via the -np 2 -perhost 1 settings on the mpiexec.hydra command. It is a quirky thing of the INTEL MPI launcher, mpiexec.hydra, that in order to enforce the -perhost 1 requirement, one must also use the ugly I_MPI_JOB ...PLACEMENT=0. Finally, because of the export OMP_NUM_THREADS=4, each mpi process spawns 4 OpenMP threads. Note, that 8 jobslots = 4 OMP threads on 1st node + 4 OMP threads on 2nd node.

#BSUB -J mpi_omp_helloWorld 
#BSUB -L /bin/bash 
#BSUB -W 20
#BSUB -n 8 
#BSUB -R 'rusage[mem=150] span[ptile=4]' 
#BSUB -M 150
#BSUB -o mpi_omp_helloWorld.%J 

# Set up environment
ml purge
ml intel/2015B
ml 

# Compile and run mpi__omp_helloWorld.exe
mpiifort -openmp -o mpi_omp_helloWorld.exe mpi_omp_helloWorld.f90
export OMP_NUM_THREADS=4                                   # Set number of OpenMP threads to 4
export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0               # Needed to respect perhost request
mpiexec.hydra -np 2 -perhost 1 ./mpi_omp_helloWorld.exe    # Run the program

Source code mpi_omp_helloWorld.f90

Program Hello_World_mpi_omp
! By SC TAMU staff:  "upgraded" from 5 years ago for Ada
! mpiifort -openmp -o mpi_omp_helloWorld.exe mpi_omp_helloWorld.f90
! mpiexec.hydra -n 2 ./mpi_omp_helloWorld.exe 
!-----------------------------------------------------------------
USE MPI 
USE OMP_LIB
character (len=MPI_MAX_PROCESSOR_NAME) host_name
character (len=4)   :: omp_id_str, omp_np_str, myid_str
integer   (KIND=4)  :: np, myid, host_name_len, ierr, omp_id, omp_np
!
call MPI_INIT(ierr)
if (ierr /= MPI_SUCCESS) STOP '-- MPI_INIT ERROR --'
!
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)
!
call MPI_GET_PROCESSOR_NAME(host_name, host_name_len, ierr) ! Returns node/host name
!
!$OMP PARALLEL PRIVATE(omp_id, omp_np, myid_str, omp_id_str, omp_np_str)
!
omp_id = OMP_GET_THREAD_NUM(); omp_np = OMP_GET_NUM_THREADS()
!
! Binary integer to string internal converts so that "print" line is more tidy.
write (myid_str, '(I4)') myid; write(omp_id_str, '(I4)') omp_id
write (omp_np_str, '(I4)') omp_np
!
print *,'- Helloo World: node ', trim(adjustl(host_name)),' MPI process # ', &
trim(adjustl(myid_str)),' THREAD_ID ', trim(adjustl(omp_id_str)), &
' of ',trim(adjustl(omp_np_str)),' OMP threads -'
!
!$OMP END PARALLEL
!
call MPI_FINALIZE(ierr)
!
end program Hello_World_mpi_omp 

Example Job 5 (MPMD)

In MPMD and hybrid MPI-OpenMP jobs you should exercise care that the job slots and ptile BSUB settings are consistent with the relevant parameters specified on the mpiexec.hydra command or whatever other application program you happen to use.

We carry out two mpmd runs here. In both, note how one passes (locally) different environment variables to different executables. Observe also that in both mpiexec.hydra runs the total number of execution threads is 60 (=number of jobslots):

  • 1st run: (2 MPI processes * 20 OpenMP threads per MPI process) + (2 MPI processes * 10 OpenMP threads per MPI process) = 60
  • 2nd run: ( 40 MPI processes } + { 2 MPI processes * 10 OpenMP threads per MPI process } = 60

Note, however, that against our expectation, in the 2nd run the 2 MPI processes (10 threads each) do not launch on separate nodes but on one, thus wasting idle a whole node. This run uses only 3 nodes. Nonetheless, this example is useful because it illustrates that process placement in a multi-node run can be tricky.

Currently the staff is exploring the capabilities of using the LSB_PJL_TASK_GEOMETRY LSF environment variable in placing flexibly different MPI processes on different nodes.

#BSUB -J mpmd_helloWorld
#BSUB -L /bin/bash 
#BSUB -W 20
#BSUB -n 60
#BSUB -R "40*{ select[nxt] rusage[mem=150] span[ptile=20]} + 20*{ select[gpu] rusage[mem=150] span[ptile=10] }"
#BSUB -M 150
#BSUB -x
#BSUB -o mpmd_helloWorld.%J

#
# 1st Case : Runs in the MPMD model with the same hybrid executables: mpi_helloWorld.exe & mpi_omp_helloWorld.exe
# The first hybrid runs 2 MPI processes, 1 per node at 20 threads per MPI process. This accounts
# for 40 job slots placed in 2 nodes. The second hybrid executable runs also 2 MPI processes, 1 per
# node, but with 10 threads per MPI process node. The role of the perhost option is critical here.
# So here we make use of all 4 nodes that the BSUB directives ask.
#
# 2nd Case: Runs in the MPMD model 1 pure MPI and 1 hybrid executable: mpi_helloWorld.exe & mpi_omp_helloWorld.exe.
# The first executable runs 40 MPI processes, the second runs 2 MPI process, each one spawning 10 threads.
# All in all this does account for 60 job slots. Unfortunately, 20 jobslots are now mapped onto 1 node only,
# not 2. This is mostly due to the fact that we have not been able to place MPI processes on a node by using a
# "local" option. (-perhost ### is a global option) 
#

# Set up environment
ml purge
ml intel/2015B
ml  

# Start first case
echo -e "\n\n ***** 1st MPMD Run ****** 1st MPMD Run ****** 1st MPMD Run ******\n\n"
export MP_LABELIO="YES"
export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0    # Needed to respect perhost request

# Run the program
mpiexec.hydra -perhost 1 -np 2 -env OMP_NUM_THREADS 20 ./mpi_omp_helloWorld.exe : \
                         -np 2 -env OMP_NUM_THREADS 10 ./mpi_omp_helloWorld.exe

# Start second case
sleep 10; echo -e "\n\n ***** 2nd MPMD Run ****** 2nd MPMD Run ****** 2nd MPMD Run ******\n\n"
export OMP_NUM_THREADS=10

# Run the program
mpiexec.hydra -np 40 ./mpi_helloWorld.exe : -np 2 ./mpi_omp_helloWorld.exe

Queues

LSF, upon job submission, sends your jobs to appropriate batch queues. These are (software) service stations configured to control the scheduling and dispatch of jobs that have arrived in them. Batch queues are characterized by all sorts of parameters. Some of the most important are:

  1. the total number of jobs that can be concurrently running (number of run slots)
  2. the wall-clock time limit per job
  3. the type and number of nodes it can dispatch jobs to
  4. which users or user groups can use that queue; etc.

These settings control whether a job will remain idle in the queue or be dispatched quickly for execution.

The current queue structure (updated on September 27, 2019).

NOTE: Each user is now limited to 8000 cores total for his/her pending jobs across all the queues.

Queue Job Min/Default/Max Cores Job Default/Max Walltime Compute Node Types Per-Queue Limits Aggregate Limits Across Queues Per-User Limits Across Queues Notes
sn_short 1 / 1 / 20 10 min / 1 hr 64 GB nodes (811)
256 GB nodes (26)
Maximum of 7000 cores for all running jobs in the single-node (sn_*) queues. Maximum of 1000 cores and 100 jobs per user for all running jobs in the single node (sn_*) queues. For jobs needing only one compute node.
sn_regular 1 hr / 1 day
sn_long 24 hr / 4 days
sn_xlong 4 days / 30 days
mn_short 2 / 2 / 200 10 min / 1 hr Maximum of 2000 cores for all running jobs in this queue. Maximum of 12000 cores for all running jobs in the multi-node (mn_*) queues. Maximum of 3000 cores and 150 jobs per user for all running jobs in the multi-node (mn_*) queues. For jobs needing more than one compute node.
mn_small 2 / 2 / 120 1 hr / 10 days Maximum of 7000 cores for all running jobs in this queue.
mn_medium 121 / 121 / 600 1 hr / 7 days Maximum of 6000 cores for all running jobs in this queue.
mn_large 601 / 601 / 2000 1 hr / 5 days Maximum of 8000 cores for all running jobs in this queue.
xlarge 1 / 1 / 280 1 hr / 10 days 1 TB nodes (11)
2 TB nodes (4)
For jobs needing more than 256GB of memory per compute node.
vnc 1 / 1 / 20 1 hr / 6 hr GPU nodes (30) For remote visualization jobs.
special None 1 hr / 7 days 64 GB nodes (811)
256 GB nodes (26)
Requires permission to access this queue.
v100 (*) 1 / 1 / 72 1 hr / 2 days 192 GB nodes, dual 32GB V100 GPUs (2)
  • V100 nodes were moved to terra in preparation for the decommissioning of Ada

LSF determines which queue will receive a job for processing. The selection is determined mainly by the resources (e.g., number of cpus, wall-clock limit) specified, explicitly or by default. There are two exceptions:

  1. The xlarge queue that is associated with nodes that have 1TB or 2TB of main memory. To use it, submit jobs with the -q xlarge option along with -R "select[mem1tb]" or -R "select[mem2tb]"
  2. The special queue which gives one access to all of the compute nodes. You MUST request permission to get access to this queue.

To access any of the above queues, you must use the -q queue_name option in your job script.

Output from the bjobs command contains the name of the queue associated with a given job.

Checkpointing

Checkpointing is the practice of creating a save state of a job so that, if interrupted, it can begin again without starting completely over. This technique is especially important for long jobs on the batch systems, because each batch queue has a maximum walltime limit.


A checkpointed job file is particularly useful for the gpu queue, which is limited to 2 days walltime due to its demand. There are many cases of jobs that require the use of gpus and must run longer than two days, such as training a machine learning algorithm.


Users can change their code to implement save states so that their code may restart automatically when cut off by the wall time limit. There are many different ways to checkpoint a job file depending on the software used, but it is almost always done at the application level. It is up to the user how frequently save states are made depending on what kind of fault tolerance is needed for the job, but in the case of the batch system, the exact time of the 'fault' is known. It's just the walltime limit of the queue. In this case, only one checkpoint need be created, right before the limit is reached. Many different resources are available for checkpointing techniques. Some examples for common software are listed below.

Advanced Documentation

This guide only covers the most commonly used options and useful commands.

For more information, check the man pages for individual commands or the LSF Manual. Ada:Batch Advanced Topics