Hprc banner tamu.png

SW:tamulauncher

From TAMU HPRC
Jump to: navigation, search

tamulauncher

tamulauncher provides a convenient way to run a large number of serial or multithreaded commands without the need to submit individual jobs or a Job array. tamulauncher takes as arguments a text file containing all commands that need to be executed and tamulauncher will execute the commands concurrently. The number of concurrently executed commands depends on the batch requirements. When tamulauncher is run interactively the number of concurrently executed commands is limited to at most 8. tamulauncher is available on terra, ada, and curie. There is no need to load any module. tamulauncher has been successfully tested to execute over 100K commands.

tamulauncher is preferred over Job Arrays to submit a large number of individual jobs, especially when the run times of the commands are relatively short. It allows for better utilization of the nodes, puts less burden on the batch scheduler, and lessens interference with jobs of other users on the same node.

Synopsis

[ NetID@ ~]$ tamulauncher --help
   Usage: /sw/local/bin/tamulauncher [options] FILE

   This script will execute commands in FILE concurrently. 

   OPTIONS:

     --commands-pernode | -p <n> 
            Set the number of concurrent processes per node.

     --norestart
            Do not restart.

     --status <commands file>
            Prints number of finished commands and exits.  

     --list <commands file>
            Prints detailed list of all finished commands and exits.

     --remove-logs <commands file>
            Removes the log directory and exits

     --version | -v
            Prints version and exits.

     --help | -h | ?
            Shows this message and exits.

Commands file

The commands file is a regular text file containing all the commands that need to be executed. Every line contains one command. A command can be a user-compiled program, a Linux command, a script (e.g. bash, Python, Perl, etc), a software package, etc. Commands can also be compounded using the Linux semi-colon operator. In general, any command that will work when typed in a bash shell will work when executed using tamulauncher. Below is an example of a commands file; it illustrates a commands file can contain any combination of commands (although in practice it's mostly a repetition of the same command with varying input parameters). Many times a commands file can be generated automatically.


./prog1 125
./prog2 "aa" 3 
mkdir testcase1 ; cd testcase1; ./myprog
./prog1 100
    :
    :
    :
time ./prog3 
python mypython.py
./prog1 141 ; ./prog4 > OUTPUT
./prog5 < myinput

Dynamic release of resources

tamulauncher will automatically release resources whenever they become idle. On ada and curie, resources will be released on a per-core basis. On terra, resources will be released on per-node basis. This feature is especially useful in cases where the majority of requested cores/nodes are idle, taking up valuable resources, while only a few cores are processing the last few commands. On ada/curie, please add following LSF line to your batch script

#BSUB -app resizable

Adding this option will enable LSF to dynamically release resources. Your tamulauncher run will still work fine without the above LSF flag, but LSF will prohibit release of resources and it will write a little warning message to your output/error file.

There are no changes required on terra to enable dynamic release of resources.

NOTE On terra you might see some slurm error messages such as "srun: error: <NODE>: task 7: Killed". These messages can be safely ignored.

NOTE: this is an experimental feature we are still improving on. For that reason, as well as well as some needed changes to calculation of SUs, the number of SUs charged will not be adjusted at this time. However, it will help to make the cluster less congested.

Examples

The following sections describe two simple examples how to use tamulauncher. The first example shows how to run serial commands and the second example shows how to run multi threaded commands.

Example 1: Simple tamulauncher run

#BSUB -L /bin/bash
#BSUB -J demo-tamulauncher
#BSUB -o demo-tamulauncher.%J
#BSUB -W 07:00
#BSUB -n 200
#BSUB -M 150
#BSUB -R 'rusage[mem=150]'
#BSUB -R 'span[ptile=20]'

# special LSF option to release resources
#BSUB -app resizable


tamulauncher  commands.in

In the above example, tamulauncher will distribute the commands among the 200 requested cores; 200 commands will be executed concurrently and every task will process N/200 commands. The batch script for curie will be exactly the same (except ptile can be max 16). On terra the script will use SLURM style directives.

Example 2: Running multi-threaded commands

LSF (on ada and curie) does not provide an easy way to specify requirements for jobs where every task (command) wants to utilize multiple cores (i.e. hybrid jobs). This might be a problem when tamulauncher needs to execute multi-threaded (e.g. OpenMP) commands. For that reason, tamulauncher provides the --commands-per-node option to explicitly set the number of concurrent commands per node. NOTE: terra uses the SLURM batch scheduler which provides an easy way to specify requirements for hybrid jobs. Therefore the --commands-per-node is mostly used for ada and terra.

ada/curie example

#BSUB -L /bin/bash
#BSUB -J demo-tamulauncher
#BSUB -o demo-tamulauncher.%J
#BSUB -W 07:00
#BSUB -n 200
#BSUB -M 100
#BSUB -R 'rusage[mem=100]'
#BSUB -R 'span[ptile=20]'

# special LSF option to release resources
#BSUB -app resizable

export OMP_NUM_THREADS=4
tamulauncher --commands-pernode 5 commands.in

In this example, tamulauncher will execute only 5 commands concurrently per node (even though ptile is set to 20). Environmental variable OMP_NUM_THREADS is set to 4 so every command will use 4 cores (threads). The total number of cores used per node is 5*4=20.

NOTE: in this case, another option would be to set ptile=5 and include #BSUB -x to reserve whole nodes.

terra example

#!/bin/bash

#SBATCH --export=NONE               
#SBATCH --get-user-env=L             

##NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=demo-tamulauncher
#SBATCH --output=demo-tamulauncher.%j
#SBATCH --time=:07:00:00            
#SBATCH --ntasks= 70                  
#SBATCH --ntasks-per-node=7          
#SBATCH --cpus-per-task=4
#SBATCH --mem=4096M                  
      

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
tamulauncher commands.in

In the above example, tamulauncher will use all the requirements specified in the SLURM script and execute 7 commands per node. The SLURM --cpus-per-task option will make sure 4 cores are reserved for every task and every command can will up to 4 threads. There is no need to specify the tamulauncher --commands-per-task in this case.

Automatic Restart

tamulauncher keeps track of all commands that have been executed. When you start a tamulauncher job it will check for a log (located in directory .tamulauncher-log) from a previous run and if that is the case, it will continue executing commands that did not finish during the previous run. This is especially useful when a tamulauncher job was killed because it ran out of wall time or there was a system problem. To turn off the automatic restart option, use the 'no-restart flag in your tamulauncher command, e.g.

  tamulauncher --no-restart commands.in

Using this option tamulauncher will just wipe all the log files and it will start as if it was a first run.

NOTE: tamulauncher keeps a log for every unique commands file. If you make any changes to the commands file, tamulauncher will assume it's a different commands file and will create a new log directory. This also means multiple tamulauncher runs can be executed in the same directory.

Monitoring runs

To see how many commands have been executed use the --status flag in tamulauncher:

  [ NetID@ ~]$ tamulauncher --status  <command file>

This will show a one-line summary with the number of commands executed and the total number of commands for the tamulauncher run on <command file>.

To see a full listing of all finished commands use the --list flag in tamulauncher:

  [ NetID@ ~]$ tamulauncher --list  <command file>

This will show a list of all commands that have finished executing, including index in the commands file, total run-time time, and exit status for the tamulauncher run on <command file>.


Clearing the log

To clear the log for a particular tamulauncher run, use the --remove-logs flag.

  [ NetID@ ~]$ tamulauncher --remove-logs  <command file>

This will clear the logs for the latest tamulauncher run on commands file <command file>. NOTE: don't clear the logs while tamulauncher is still running on that particular <commands file>.