Hprc banner tamu.png

Ada:Batch Job Submission

From TAMU HPRC
Revision as of 11:07, 24 November 2014 by Yangliu (talk | contribs) (Job Submission: the bsub command)
Jump to: navigation, search

Job Submission: the bsub command

bsub < jobfile                  # Submits specified job for processing by LSF

Here is an illustration,

[userx@login4]$ bsub < sample1.job
Verifying job submission parameters...
Job <224139> is submitted to default queue <devel>.
[userx@login4]$

The first thing LSF does upon submission is to tag your job with a numeric identifier, a job id. Above, that identifier is 224139. You will need it in order to track or manage (kill or modify) your jobs. Next, note that the default current working directory for the job is the directory you submitted the job from. If that's not what you need, you must explicitly indicate that, as we do above when we cd into a specific directory. On job completion, LSF will place in the submission directory the file stdout1.224139. It contains a log of job events and other data directed to standard out. Always inspect this file for useful information.

Three important job parameters:

#BSUB -n NNN                    # NNN: total number of cpus to allocate for the job
#BSUB -R "span[ptile=NN]"       # NN:  number of cores/cpus per node to use
#BSUB -R "select[node-type]"    # node-type: nxt, mem256gb, gpu, phi, mem1t, mem2t ...

We list these together because in many jobs they can be closely related and, therefore, must be consistently set. We recommend their adoption in all jobs, serial, single-node and multi-node. The following examples, with some commentary, illustrate their use.

#BSUB -n 900                    # 900: number of cpus to allocate for the job
#BSUB -R "span[ptile=20]"       # 20:  number of cores/cpus per node to use
#BSUB -R "select[nxt]"          # Allocates NeXtScale nodes

The above specifications will allocate 45 (=900/20) whole nodes. In many parallel jobs the selection of NeXtScale nodes at 20 cores per node is the best choice.

#BSUB -n 900                    # 900: total number of cpus to allocate for the job
#BSUB -R "span[ptile=16]"       # 16:  number of cores/cpus per node to use
#BSUB -R "select[nxt]" -x       # Allocates exclusively whole NeXtScale nodes

The above specifications will allocate 57 (= ceiling(900/16)) nodes. The exclusive (-x) node allocation requested here may be important for multi-node parallel jobs that need it. It will prevent the scheduling of other jobs on such nodes, jobs which might use 4 cores or less. The absence of -x, can find one or more of the 57 nodes hosting more than one job. This can drastically reduce the performance of the 900-core job. The justification for "waisting" 4 cores per node can be a valid one depending on specific program behavior, such as memory or communication traffic. For sure, the decision to go with 16 cores per node or less should be taken after carefull experimentation. Applying the -x option will cost you, in terms of SUs, the same as the use of 20 cores, not 16. So use it sensibly.


#BSUB -n 1                    # Allocate a total of 1 cpu/core for the job, appropriate for serial processing.
#BSUB -R "span[ptile=1]"      # Allocate 1 cpu per node.
#BSUB -R "select[gpu]"        # Make the allocated node have gpus, of 64GB or 256GB memory. A "select[phi]"
                              # specification would allocate a node with phi coprocessors.

Omitting the last two options in the above will cause LSF to place the job on any conveniently available core on any node, idle or busy, of any type, except on those with 1TB or 2TB memory.


Common BSUB Options

... pending ...

More Examples

... pending ...

Example 2

#BSUB -N OpenMP1 ...
#BSUB ..
....

Example 3

## A multi-node MPI job 
#BSUB -J mpitest -o mpitest.%J -L /bin/bash -W 30 -n 200 -R 'span[ptile=20]'

module load ictce         # load intel toolchain

## ONLY SET THESE VARIABLES FOR RUNNING INTEL MPI JOBS (WITH MORE THAN 100
CORES)
# tells Intel MPI to launch MPI processes using LSF's blaunch
export I_MPI_HYDRA_BOOTSTRAP=lsf
# tell Intel MPI to launch only one blaunch instance (for scalability and
# stability)
export I_MPI_LSF_USE_COLLECTIVE_LAUNCH=1
# set this variable to the number of hosts ie. (-n value) divided by (ptile
# value)
export I_MPI_HYDRA_BRANCH_COUNT=40

# launch MPI program using the hydra launcher
mpiexec.hydra ./hw.mpi.C.exe
....
....

Environment Variables

When LSF selects and activates a node for the running of your job it executes a login to that node. The environment of that login process is mostly a duplicate of the process you launched (bsub) your job from. In general, it is recommended that you specify the creation of a new shell without any added features that the launching process may have acquired, say, by loading one or more application modules. These may conflict or be irrelevant to the modules you do need to load, within a job, for job execution. Hence, the recommendation for specifying the #BSUB -L /bin/bash option in a job file.

All the nodes enlisted for the execution of a job carry most of the environment variables of the login process: HOME, PWD, PATH, USER, etc. IN addition, LSF defines new ones. Below, we show an abbreviated list.

LSB_QUEUE:     The name of the queue the job is dispatched from.
LSB_JOBNAME:   Name of the job.
LSB_JOBID:     Batch job ID assigned by LSF.
LSB_ERRORFILE: Name of the error file specified with a bsub -e.
LSB_HOSTS:     The list of nodes (their names) that are used to run the batch job.

Job tracking and control commands

bjobs [-u all or user_name] [[-l] job_id]    # displays job information per user(s) or job_id, in summary or detail (-l) form
bpeek [-f] job_id                            # displays the stdout and stderr output of an unfinished job
bkill job_id                                 # kills, suspends, or resumes unfinished jobs. See man bkill for details
bmod  job_id [bsub_options]                  # Modifies job submission options of a job. See man bmod for details

Example

[userx@login4]$ bjobs -u all
JOBID      STAT  USER             QUEUE      JOB_NAME             NEXEC_HOST SLOTS RUN_TIME        TIME_LEFT
223537     RUN   adinar           long       NOR_Q                1          20    400404 second(s) 8:46 L
223547     RUN   adinar           long       NOR_Q                1          20    399830 second(s) 8:56 L
223182     RUN   tengxj1025       long       pro_at16_lowc        10         280   325922 second(s) 5:27 L
229307     RUN   natalieg         long       LES_MORE             3          900   225972 second(s) 25:13 L
229309     RUN   tengxj1025       long       pro_atat_lowc        7          280   223276 second(s) 33:58 L
229310     RUN   tengxj1025       long       cg16_lowc            5          280   223228 second(s) 33:59 L
. . .             . . .     . . .

[userx@login4]$ bjobs -l 229309

Job <229309>, Job Name <pro_atat_lowc>, User <tengxj1025>, Project <default>, M
                          ail <czjnbb@gmail.com>, Status <RUN>, Queue <long>, J
                          ob Priority <250000>, Command <## job name;#BSUB -J p
                          ro_atat_lowc; ## send stderr and stdout to the same f
                          ile ;#BSUB -o info.%J; ## login shell to avoid copyin
                          g env from login session;## also helps the module fun
                          ction work in batch jobs;#BSUB -L /bin/bash; ## 30 mi
                          nutes of walltime ([HH:]MM);#BSUB -W 96:00; ## numpro
                          cs;#BSUB -n 280; . . .
                          . . .

 RUNLIMIT
 5760.0 min of nxt1449
Tue Nov  4 21:34:43 2014: Started on 280 Hosts/Processors <nxt1449> <nxt1449> <
                          nxt1449> <nxt1449> <nxt1449> <nxt1449>  ...
                          . . .

Execution
                          CWD </scratch/user/tengxj1025/EXTD/pro_atat/lowc/md>;
Fri Nov  7 12:05:55 2014: Resource usage collected.
                          The CPU time used is 67536997 seconds.
                          MEM: 44.4 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 862

                          HOST: nxt1449
                          MEM: 3.2 Gbytes;  SWAP: 0 Mbytes; CPU_TIME: 9004415 s
                          econds . . .
                          . . .