Hprc banner tamu.png

Ada:Batch Job Submission

Revision as of 16:22, 2 December 2015 by Francis (talk | contribs) (Environment Variables)
Jump to: navigation, search

Job Submission: the bsub command

bsub < jobfile                  # Submits specified job for processing by LSF

Here is an illustration,

$ bsub < sample1.job
Verifying job submission parameters...
Job <224139> is submitted to default queue <devel>.

The first thing LSF does upon submission is to tag your job with a numeric identifier, a job id. Above, that identifier is 224139. You will need it in order to track or manage (kill or modify) your jobs. Next, note that the default current working directory for the job is the directory you submitted the job from. If that's not what you need, you must explicitly indicate that, as we do above when we cd into a specific directory. On job completion, LSF will place in the submission directory the file stdout1.224139. It contains a log of job events and other data directed to standard out. Always inspect this file for useful information.

By default, a job executes under the environment of the submitting process. This you can change by using the -L shell option (see below) and/or by specifying at the start of the job script the shell that will execute it. For example, if you want the job to execute under the C-shell, the first command above the #BSUB directives should be #!/bin/csh.

Five important job parameters:

#BSUB -n NNN                    # NNN: total number of cores/jobslots to allocate for the job
#BSUB -R "span[ptile=XX]"       # XX:  number of cores/jobslots per node to use. Also, a node selection criterion
#BSUB -R "select[node-type]"    # node-type: nxt, mem256gb, gpu, phi, mem1t, mem2t ...
#BSUB -R "rusage[mem=nnn]"      # reserves nnn MBs per process/CPU for the job
#BSUB -M mm                     # sets the per process enforceable memory limit to nnn MB

We list these together because in many jobs they can be closely related and, therefore, must be consistently set. We recommend their adoption in all jobs, serial, single-node and multi-node. The rusage[mem=nnn] setting causes LSF to select nodes that can each allocate XX * nnn MBs for the execution of the job. The -M mm sets and enforces the process memory size limit. When this limit is violated the job will abort. Omitting this specification, causes LSF to assume the default memory limit, which by configuration is set to 2.5 giga-bytes (2500 MB) per process. The following examples, with some commentary, illustrate the use of these options.

Important: if the process memory limit, default (2500 MB) or specified, is exceeded during execution the job will fail with a memory violation error.

#BSUB -n 900                    # 900: number of cores/jobslots to allocate for the job
#BSUB -R "span[ptile=20]"       # 20:  number of cores per node to use
#BSUB -R "select[nxt]"          # Allocates NeXtScale type nodes

The above specifications will allocate 45 (=900/20) whole nodes. In many parallel jobs the selection of NeXtScale nodes at 20 cores per node is the best choice. Here, the maximum memory per process is set to 2500 MB. Here, we're just illustrating what happens when you omit the memory-related options. We definitely urge that you specify them. The memory enforceable limit per process here is 2.5 MB, the default setting.

#BSUB -n 900                    # 900: total number of cores/jobslots to allocate for the job
#BSUB -R "span[ptile=16]"       # 16:  number of cores/jobslots per node to use
#BSUB -R "select[nxt]"          # allocates NeXtScale type nodes
#BSUB -R "rusage[mem=3600]"     # schedules on nodes that have at least 3600 MB per process/CPU avail
#BSUB -M 3600                   # enforces 3600 MB memory use per process 

The above specifications will allocate 57 (= ceiling(900/16)) nodes. The decision to only apply XX (here 16) number cores per node, and not the maximum 20, for a computation requires some judgement. The execution profile of the job is important. Typically, some experimentation is required in finding the optimal tile number for a given code.

#BSUB -n 1                    # Allocate a total of 1 cpu/core for the job, appropriate for serial processing.
#BSUB -R "span[ptile=1]"      # Allocate 1 core per node.
#BSUB -R "select[gpu]"        # Allocate a node that has gpus (of 64GB or 256GB memory). A "select[phi]"
                              # specification would allocate a node with phi coprocessors.

Omitting the last two options in the above will cause LSF to place the job on any conveniently available core on any node, idle or (partially) busy, of any type, except on those with 1TB or 2TB memory.

It is worth emphasizing that, under the current LSF setup, only the -x option and a ptile value equal to the node's core limit will prevent LSF from scheduling jobs that match the balance of unreserved cores.

Inhomogeneous Node Selection

#BSUB -n 900
#BSUB -R "600*{ select[nxt] rusage[mem=3000] span[ptile=20]} + 300*{ select[gpu] rusage[mem=3000] span[ptile=20] }"
#BSUB -M 3000

The above specification will allocate 30 (600/20) NeXtScale and 15 (300/20) iDataPlex nodes, the latter with GPUs, at 20 cores per node. Note that the enforceable memory limit here 3000 MB per process. In the Examples section, we provide an illustration of the usefulness of inhomogeneous node selection when the MPMD parallelization model is to be used.

Common BSUB Options

-J job name           - sets the job name.
-q queue              - submits job to the specified queue. Currently (June 2015), this specification is needed only
                        for the following queues: xlarge, special, staff. The first is open to all users and directs jobs exclusively
                        to the 1TB or 2TB main memory nodes. Access to the special and staff queues is restricted.
-L shell              - uses the Unix Shell specified to initialize the job's execution environment. The setting of
                       this option is required for the module system to work correctly. We recommend that the setting
                       be /bin/bash. Some application packages setup their own shell. If you encounter a problem, notify
                       the help desk.
-W hh:mm or -mm       - sets job's runtime wall-clock limit in hours:minutes or just minutes (-mm). 
-M men_limit          - sets the per process memory limit in mega-bytes (MBs). The job's memory limit then is
                        num_cores * men_limit MBs. When this limit is violated the jobs aborts.
-R "rusage[mem=memsz]" - schedules job on nodes that have at least memsz MBs available per process/CPU/core
-n num_cores          - assigns number of job slots/cores.
-x                    - assigns a whole node (same node as above) exclusively for the job. The SUs charged reflect use of
                        all the cores in a node.
-o filename           - directs the job's standard output to name. The special string, %J, attaches the jobid.
-P project_ID         - An integer number that uniquely identifies the project (in the user accounts database) against which
                        the used service units (SUs) are associated with. When not specified, the default value is that of the first
                        project defined in the user accounts database.
-u e-mail_addr        - sends email to the specified address (e.g., netid@tamu.edu, myname@gmail.com) with
                        information about main job events.
-B                    - sends email when job starts/begins
-N                    - sends email when job ends

Environment Variables

When LSF selects and activates a node for the running of your job, by default, it duplicates the environment the job was submitted from. That environment in the process of your work may have been altered by you (e.g., by loading some modules or setting up new or changing some standard environment variables) to be different from that the login created. The next job you submit, however, may require a different execution environment. Hence the recommendation that, in submitting jobs, specify the creation of a new login shell and within the job explicitly customize the environment as needed. A new login shell per job is initialized by specifying the #BSUB -L /bin/bash option.

All the nodes enlisted for the execution of a job carry most of the environment variables the login process created: HOME, PWD, PATH, USER, etc. In addition, LSF defines new ones in the environment of an executing job. Below, we show an abbreviated list.

LSB_QUEUE       - The name of the queue the job is dispatched from.
LSB_JOBNAME     - Name of the job.
LSB_JOBID       - Batch job ID assigned by LSF.
LSB_ERRORFILE   - Name of the error file specified with a bsub -e.
LSB_HOSTS       - The list of nodes (their LSF symbolic names) that are used to run the batch job.
                  A node name is repeated as many times as needed to equal the specified ptile value. 
                  The memory size of LSB_HOSTS variable is limited to 4096 bytes.
LSB_MCPU_HOSTS  - The list of nodes (their LSF symbolic names) ) and the specified or default ptile value.
                  per node to run the batch job. This can be relied upon to contain the names of all
                  the deployed hosts.
LS_SUBCWD       - This is the directory the job was submitted from.
TMPDIR          - Set to /work/jobid.tmpdir. LSF and some application programs use it for temporary files.

All of the above are accessible both in a batch script, at the shell level, as well as within a program (see the EXAMPLES subsection).

Job tracking and control commands

bjobs [-u all or user_name] [[-l] job_id]    # displays job information per user(s) or job_id, in summary or detail (-l) form, respectively.
bpeek [-f] job_id                            # displays the current contents of stdout and stderr output of an executing job.
bkill job_id                                 # kills, suspends, or resumes unfinished jobs. See man bkill for details.
bmod [bsub_options]   job_id                 # Modifies job submission options of a job. See man bmod for details.
lsload [node_name]                           # Lists on std out a node's utilization. Use bjobs -l jobid
                                             # to get the names of nodes associated with a jobid. See man lsload for details.

All of the above have decent man pages, if you're interested in more detail.


$ bjobs -u all
JOBID      STAT  USER             QUEUE      JOB_NAME             NEXEC_HOST SLOTS RUN_TIME        TIME_LEFT
223537     RUN   adinar           long       NOR_Q                1          20    400404 second(s) 8:46 L
223547     RUN   adinar           long       NOR_Q                1          20    399830 second(s) 8:56 L
223182     RUN   tengxj1025       long       pro_at16_lowc        10         280   325922 second(s) 5:27 L
229307     RUN   natalieg         long       LES_MORE             3          900   225972 second(s) 25:13 L
229309     RUN   tengxj1025       long       pro_atat_lowc        7          280   223276 second(s) 33:58 L
229310     RUN   tengxj1025       long       cg16_lowc            5          280   223228 second(s) 33:59 L
. . .             . . .     . . .

$ bjobs -l 229309

Job <229309>, Job Name <pro_atat_lowc>, User <tengxj1025>, Project <default>, M
                          ail <czjnbb@gmail.com>, Status <RUN>, Queue <long>, J
                          ob Priority <250000>, Command <## job name;#BSUB -J p
                          ro_atat_lowc; ## send stderr and stdout to the same f
                          ile ;#BSUB -o info.%J; ## login shell to avoid copyin
                          g env from login session;## also helps the module fun
                          ction work in batch jobs;#BSUB -L /bin/bash; ## 30 mi
                          nutes of walltime ([HH:]MM);#BSUB -W 96:00; ## numpro
                          cs;#BSUB -n 280; . . .
                          . . .

 5760.0 min of nxt1449
Tue Nov  4 21:34:43 2014: Started on 280 Hosts/Processors <nxt1449> <nxt1449> <
                          nxt1449> <nxt1449> <nxt1449> <nxt1449>  ...
                          . . .

                          CWD </scratch/user/tengxj1025/EXTD/pro_atat/lowc/md>;
Fri Nov  7 12:05:55 2014: Resource usage collected.
                          The CPU time used is 67536997 seconds.
                          MEM: 44.4 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 862

                          HOST: nxt1449
                          MEM: 3.2 Gbytes;  SWAP: 0 Mbytes; CPU_TIME: 9004415 s
                          econds . . .
                          . . .
                          . . .

$ bmod -W 46:00 229309            # resets wall-clock time to 46 hrs for job 229309

Node Utilization. It may happen that a job uses its allocated nodes inefficiently. Sometimes this is unavoidable, but many times it is very avoidable. It is unavoidable, for instance, if the amount of memory used per node is a large fraction of the total for that node, and only 1 cpu is used. In that case, cpu utilization will be at best at 5% (1/20) in a regular node. A handy tool, more practical than lsload, for tracking node utilization is the lnu homegrown command.

lnu [-h] [-l] -j jobid          # lists on stdout the utilization across all nodes for an executing job. See examples below.


$ lnu -l -j 795375
Job          User                 Queue        Status Node  Cpus
795375       jomber23             medium            R    4    80   
        HOST_NAME       status  r15s   r1m  r15m   ut    pg  ls    it   tmp   swp   mem    Assigned Cores
        nxt1417             ok  20.0  21.0  21.0  97%   0.0   0 94976  366M  3.7G 41.6G    20
        nxt1764 (L)         ok  19.7  20.0  20.0  95%   0.0   0 95040  366M  3.7G 41.5G    20
        nxt2111             ok  20.0  20.0  20.0  98%   0.0   0 91712  370M  4.2G 41.5G    20
        nxt2112             ok  20.0  21.1  21.0  97%   0.0   0 91712  370M  4.2G 41.6G    20

$ lnu -l -j 753454
Job          User                 Queue        Status Node  Cpus
753454       ajochoa              long              R    1    20   
        HOST_NAME       status  r15s   r1m  r15m   ut    pg  ls    it   tmp   swp   mem    Assigned Cores
        nxt1222 (L)         ok   4.3   4.5   6.2  20%   0.0   0 54464  422M  4.7G 52.9G    20

The utilization (ut) and memory paging (pg), overall, are probably the most significant. Note that the tmp, swp, and mem refer to available amounts respectively.