Hprc banner tamu.png

Difference between revisions of "Ada:Batch Job Submission"

From TAMU HPRC
Jump to: navigation, search
m (Common BSUB Options)
(tamubatch)
 
(136 intermediate revisions by 9 users not shown)
Line 1: Line 1:
==Job Submission: the bsub command==
+
== Job Submission ==
<pre>
+
Once you have your job file ready, it is time to submit your job. You can submit your job to LSF with the following command:
bsub < jobfile                  # Submits specified job for processing by LSF
+
[ NetID@ada ~]$ '''bsub < ''MyJob.LSF'''''
</pre>
+
Verifying job submission parameters...
 +
Verifying project account...
 +
      Account to charge:  123456789123
 +
          Balance (SUs):      5000.0000
 +
          SUs to charge:        5.0000
 +
Job <12345> is submitted to default queue <sn_regular>.
  
Here is an illustration,
+
== tamubatch ==
  
<pre>
+
'''tamubatch''' is an automatic batch job script that submits jobs for the user without the need of writing a batch script on the Ada and Terra clusters. The user just needs to provide the executable commands in a text file and tamubatch will automatically submit the job to the cluster. There are flags that the user may specify which allows control over the parameters for the job submitted.
[userx@login4]$ bsub < sample1.job
 
Verifying job submission parameters...
 
Job <224139> is submitted to default queue <devel>.
 
[userx@login4]$
 
</pre>
 
  
The first thing LSF does upon submission is to tag your job with a numeric identifier, a job id.
+
''tamubatch is still in beta and has not been fully developed. Although there are still bugs and testing issues that are currently being worked on, tamubatch can already submit jobs to both the Ada and Terra clusters if given a file of executable commands. ''
Above, that identifier is '''224139'''. You will need it in order to track or manage (kill or modify)
 
your jobs. Next, note that the default current working directory for the job is the directory
 
you submitted the job from. If that's not what you need, you must explicitly indicate that, as we
 
do above when we cd into a specific directory. On job completion, LSF will place in the submission
 
directory the file stdout1.224139. It contains a log of job events and other data directed to
 
standard out. Always inspect this file for useful information.
 
  
'''Three important job parameters:'''
+
For more information, visit [https://hprc.tamu.edu/wiki/SW:tamubatch this page.]
<pre>
 
#BSUB -n NNN                    # NNN: total number of cpus to allocate for the job
 
#BSUB -R "span[ptile=XX]"      # XX: number of cores/cpus per node to use
 
#BSUB -R "select[node-type]"    # node-type: nxt, mem256gb, gpu, phi, mem1t, mem2t ...
 
</pre>
 
  
We list these together because in many jobs they can be closely related and, therefore, must be
+
== tamulauncher ==
consistently set. We recommend their adoption in all jobs, serial, single-node and multi-node.
 
The following examples, with some commentary,  illustrate their use.
 
  
<pre>
+
'''tamulauncher''' provides a convenient way to run a large number of serial or multithreaded commands without the need to submit individual jobs or a Job array. User provides a text file containing all commands that need to be executed and tamulauncher will execute the commands concurrently. The number of concurrently executed commands depends on the batch requirements. When tamulauncher is run interactively the number of concurrently executed commands is limited to at most 8. tamulauncher is available on terra, ada, and curie. There is no need to load any module before using tamulauncher. tamulauncher has been successfully tested to execute over 100K commands.
#BSUB -n 900                    # 900: number of cpus to allocate for the job
 
#BSUB -R "span[ptile=20]"      # 20:  number of cores/cpus per node to use
 
#BSUB -R "select[nxt]"          # Allocates NeXtScale nodes
 
</pre>
 
  
The above specifications will allocate 45 (=900/20) whole nodes. In many parallel jobs the selection
+
''tamulauncher is preferred over Job Arrays to submit a large number of individual jobs, especially when the run times of the commands are relatively short. It allows for better utilization of the nodes, puts less burden on the batch scheduler, and lessens interference with jobs of other users on the same node.'' 
of NeXtScale nodes at 20 cores per node is the best choice.
 
  
<pre>
+
For more information, visit [https://hprc.tamu.edu/wiki/SW:tamulauncher#tamulauncher this page.]
#BSUB -n 900                    # 900: total number of cpus to allocate for the job
 
#BSUB -R "span[ptile=16]"      # 16: number of cores/cpus per node to use
 
#BSUB -R "select[nxt]" -x      # Allocates exclusively whole NeXtScale nodes
 
</pre>
 
  
The above specifications will allocate 57 (= ceiling(900/16)) nodes. The exclusive ('''-x''') node allocation
+
[[ Category:Ada ]]
requested here may be important for multi-node parallel jobs that need it. It will prevent the scheduling
 
of other jobs on such nodes, jobs which might use 4 cores or less. The absence of -x, can find one  or
 
more of the 57 nodes hosting more than one job. This can drastically reduce the performance of the
 
900-core job. The justification for "waisting" 4 cores per node can be a valid one depending on specific
 
program behavior, such as memory or communication traffic. For sure, the decision to go with 16 cores
 
per node or less should be taken after carefull experimentation. Applying the -x option will cost
 
you, in terms of SUs, the same as the use of 20 cores, not 16. So use it sensibly.
 
 
 
 
 
<pre>
 
#BSUB -n 1                    # Allocate a total of 1 cpu/core for the job, appropriate for serial processing.
 
#BSUB -R "span[ptile=1]"      # Allocate 1 cpu per node.
 
#BSUB -R "select[gpu]"        # Make the allocated node have gpus, of 64GB or 256GB memory. A "select[phi]"
 
                              # specification would allocate a node with phi coprocessors.
 
</pre>
 
 
 
Omitting the last two options in the above will cause LSF to place the job on any conveniently available
 
core on any node, idle or busy, of any type, except on those with 1TB or 2TB memory.<br>
 
 
 
It is worth emphasizing that, under the current LSF setup, only the '''-x''' option and a ptile value equal to the node's
 
core limit will prevent LSF from scheduling jobs that match the balance of unreserved cores.
 
 
 
 
 
====Common BSUB Options====
 
<pre>
 
-J job name          - sets the job name.
 
-L /bin/bash          - uses the bash login shell to initialize the job's execution environment.
 
-W hh:mm or -mm          - sets job's runtime wall-clock limit. A simple xx specification indicates minutes.
 
-M men_limit          - sets the per process memory limit in mega-bytes (MBs). The job's memory limit then is num_cores * men_limit.
 
-n num_cores          -  assigns number of job slots/cores.
 
-x                    - assigns a whole node (same node as above) exclusively for the job. The SUs charged reflect use of all the cores in a node.
 
-o filename          - directs the job's standard output to name. The special string, %J, attaches the jobid.
 
-P project_name      - charges the consumed service units (SUs) to project specified.
 
-u e-mail_addr        - sends email to the specified address (e.g., netid@tamu.edu, myname@gmail.com) with information about main job events.
 
</pre>
 
 
 
====More Examples====
 
In the following four job scripts, we illustrate in four different ways the execution of an application program, ABAQUS, to solve the same engineering problem specified
 
in the '''s4b.inp''' input file. The latter can be copied from the "Examples" database of ABAQUS by using the '''fetch''' option. Keep in mind, please, that not all problems
 
specified via ABAQUS are amenable to different types of effective parallelization. <br>
 
 
 
It is very important when running packaged code that the resource parameters (e.g., cpus, memory, gpu) you specify via BSUB directives are in agreement with their
 
counterparts on the application's command line. It turns out that the engineering problem described in s4b.inp shows remarkable improvement in performance as
 
we try different modes of execution: serial, GPU, OpenMP, and finally to MPI.<br>
 
 
 
'''Example 2 (Serial)'''
 
 
 
<pre>
 
#BSUB -J s4b_serial -o s4b_serial.%J -W 400 -L /bin/bash -n 1 -R 'span[ptile=1]' -M 42000 -R 'select[nxt]'
 
##                            1 * 42,000MB = 42 GB mem_limit
 
#
 
mkdir $SCRATCH/abaqus; cd $SCRATCH/abaqus
 
#
 
module load ictce
 
module load ABAQUS
 
#
 
abaqus fetch job=s4b.inp
 
##
 
abaqus analysis job=s4b_serial_nxt input=s4b.inp memory="32 gb" double scratch=$SCRATCH/abaqus
 
</pre>
 
 
 
'''Example 3 (OpenMP)'''
 
 
 
<pre>
 
#BSUB -J s4b_smp -o s4b_smp.%J -L /bin/bash -W 40 -n 20 -R 'span[ptile=20]' -M 20000 -R 'select[nxt]'
 
##                                                                          20*2000MB = 40,000MB = 40 GB
 
## OpenMP/Multi-threaded run on 20 cores
 
#
 
mkdir $SCRATCH/abaqus
 
cd $SCRATCH/abaqus
 
#
 
module load ictce
 
module load ABAQUS
 
#
 
abaqus fetch job=s4b.inp
 
##
 
abaqus analysis job=s4b_smp input=s4b.inp mp_mode=threads cpus=20 memory="32 gb" double scratch=$SCRATCH/abaqus
 
#
 
</pre>
 
 
 
'''Example 3 (MPI)'''
 
 
 
<pre>
 
#BSUB -J s4b_mpi64 -o s4b_mpi64.%J -L /bin/bash -W 200 -n 64 -R 'span[ptile=16]' -M 2500 -x
 
##
 
## runs a 64-way mpi job, 16-core per node, across 4 nodes. Total memory limit, 64 * 2500 MB = 160,000MB =160 GB
 
#
 
mkdir $SCRATCH/abaqus
 
cd $SCRATCH/abaqus
 
#
 
module load ictce
 
module load ABAQUS
 
#
 
abaqus fetch job=s4b.inp
 
#
 
abaqus analysis job=s4b_mpi64 input=./s4b.inp  mp_mode=mpi cpus=64 memory="150 gb" double scratch=$SCRATCH/abaqus
 
#
 
</pre>
 
 
 
'''Example 4 (GPU)'''
 
 
 
<pre>
 
#BSUB -J s4b_gpu -o s4b_gpu.%J -L /bin/bash -W 40 -n 1 -R 'span[ptile=1]' -M 400000 -R 'select[gpu256gb]'
 
##                                                                            1*40,000MB = 40GB
 
mkdir $SCRATCH/abaqus
 
cd  $SCRATCH/abaqus
 
#
 
module load ictce
 
module load ABAQUS
 
#
 
abaqus fetch job=s4b.inp
 
##
 
abaqus analysis job=s4b_gpu input=s4b.inp gpus=1 memory="32 gb" double scratch=$SCRATCH/abaqus
 
#
 
</pre>
 
 
 
====Environment Variables====
 
 
 
When LSF selects and activates a node for the running of your job, by default, it duplicates the environment the job was submitted from. That
 
environment in the process of your work may have been altered by you (e.g., by loading some modules or setting up new or changing some standard environment variables)
 
to be different from that that the login created. The next job you submit, however, may require a different execution environment. Hence the
 
recommendation that, in submitting jobs, specify the creation of a new login shell and within the job explicitly customize the environment as needed.
 
A new login shell per job is initialized by specifying the '''#BSUB -L /bin/bash''' option.<br>
 
 
 
All the nodes enlisted for the execution of a job carry most of the environment variables the login process created: HOME, PWD, PATH, USER, etc.
 
In addition, LSF defines new ones in the environment of an executing job. Below, we show an abbreviated list.
 
 
 
<pre>
 
LSB_QUEUE:    The name of the queue the job is dispatched from.
 
LSB_JOBNAME:  Name of the job.
 
LSB_JOBID:    Batch job ID assigned by LSF.
 
LSB_ERRORFILE: Name of the error file specified with a bsub -e.
 
LSB_HOSTS:    The list of nodes (their LSF symbolic names) that are used to run the batch job. A node name is repeated
 
              as many times as needed to equal the specified ptile value. The memory size of LSB_HOSTS variable is limited to 4096 bytes.
 
LSB_MCPU_HOSTS: The list of nodes (their LSF symbolic names) ) and the specified or default ptile value per node to run the batch job. This
 
                can be relied upon to contain the names of all the deployed hosts.
 
</pre>
 
 
 
'''Example.'''
 
The following is a Linux script to be used within a job to periodically track the load level on each of the allocated nodes.<br>
 
 
 
<pre>
 
#!/bin/bash
 
echo 'HOST_NAME      status  r15s  r1m  r15m  ut    pg  ls    it  tmp  swp  mem'
 
echo $LSB_MCPU_HOSTS | sed 's/ [1-4]./''\n''/g' | \
 
while read node_id
 
do
 
  lsload -l $node_id | sed '^HOST/d'
 
done
 
</pre>
 
 
 
==Job tracking and control commands==
 
 
 
<pre>
 
bjobs [-u all or user_name] [[-l] job_id]    # displays job information per user(s) or job_id, in summary or detail (-l) form.
 
bpeek [-f] job_id                            # displays the stdout and stderr output of an unfinished job.
 
bkill job_id                                # kills, suspends, or resumes unfinished jobs. See man bkill for details.
 
bmod [bsub_options]  job_id                # Modifies job submission options of a job. See man bmod for details.
 
lsload [node_name]                          # Lists on std out a node's utilization. Use bjobs -l jobid
 
                                            # to get the names of nodes associated with a jobid. See man lsload for details.
 
</pre>
 
 
 
'''Examples'''
 
<pre>
 
[userx@login4]$ bjobs -u all
 
JOBID      STAT  USER            QUEUE      JOB_NAME            NEXEC_HOST SLOTS RUN_TIME        TIME_LEFT
 
223537    RUN  adinar          long      NOR_Q                1          20    400404 second(s) 8:46 L
 
223547    RUN  adinar          long      NOR_Q                1          20    399830 second(s) 8:56 L
 
223182    RUN  tengxj1025      long      pro_at16_lowc        10        280  325922 second(s) 5:27 L
 
229307    RUN  natalieg        long      LES_MORE            3          900  225972 second(s) 25:13 L
 
229309    RUN  tengxj1025      long      pro_atat_lowc        7          280  223276 second(s) 33:58 L
 
229310    RUN  tengxj1025      long      cg16_lowc            5          280  223228 second(s) 33:59 L
 
. . .            . . .    . . .
 
 
 
[userx@login4]$ bjobs -l 229309
 
 
 
Job <229309>, Job Name <pro_atat_lowc>, User <tengxj1025>, Project <default>, M
 
                          ail <czjnbb@gmail.com>, Status <RUN>, Queue <long>, J
 
                          ob Priority <250000>, Command <## job name;#BSUB -J p
 
                          ro_atat_lowc; ## send stderr and stdout to the same f
 
                          ile ;#BSUB -o info.%J; ## login shell to avoid copyin
 
                          g env from login session;## also helps the module fun
 
                          ction work in batch jobs;#BSUB -L /bin/bash; ## 30 mi
 
                          nutes of walltime ([HH:]MM);#BSUB -W 96:00; ## numpro
 
                          cs;#BSUB -n 280; . . .
 
                          . . .
 
 
 
RUNLIMIT
 
5760.0 min of nxt1449
 
Tue Nov  4 21:34:43 2014: Started on 280 Hosts/Processors <nxt1449> <nxt1449> <
 
                          nxt1449> <nxt1449> <nxt1449> <nxt1449>  ...
 
                          . . .
 
 
 
Execution
 
                          CWD </scratch/user/tengxj1025/EXTD/pro_atat/lowc/md>;
 
Fri Nov  7 12:05:55 2014: Resource usage collected.
 
                          The CPU time used is 67536997 seconds.
 
                          MEM: 44.4 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 862
 
 
 
                          HOST: nxt1449
 
                          MEM: 3.2 Gbytes;  SWAP: 0 Mbytes; CPU_TIME: 9004415 s
 
                          econds . . .
 
                          . . .
 
                          . . .
 
 
 
 
 
[userx@login4]$ bmod -W 46:00 229309            # resets wall-clock time to 46 hrs for job 229309
 
 
 
 
 
</pre>
 
 
 
 
 
'''The lsload command & Node utilization.''' It may happen that a job uses its allocated nodes inefficiently.
 
Sometimes this is unavoidable, but many times it is very avoidable. It is unavoidable, for instance, if the
 
amount of memory used per node is a large fraction of the total for that node, and only 1 cpu is used. In
 
that case, cpu utilization will be at best at 5% (1/20) in a regular node. The main tool for tracking node
 
utilization is the lsload command.
 
 
 
<pre>
 
lsload [node_name]               # Lists on std out a node's utilization. (Focus on ut and pg columns). Use bjobs -l jobid
 
                                  # to get the names of nodes associated with a jobid.
 
</pre>
 
 
 
Below we list the output from the homemade shell script, node_use, based on lsload, which uses as input the jobid. The 6 nodes attached to job 260291 exhibit
 
fairly uneven usage ('''ut''' column): the first three nodes versus the bottom three. Any non-zero values for the '''pg''' column would also have been a point of
 
concern. It would signify potential memory paging issues with consequent slowdowns.
 
 
 
<pre>
 
./node_use 260291
 
HOST_NAME        status  r15s  r1m  r15m  ut    pg  ls  it    tmp  swp  mem
 
nxt1739              ok  20.4  21.1  20.9 100%  0.0  0  6224 491M  4.7G 43.2G
 
nxt2130              ok  20.3  20.1  19.8 100%  0.0  0  2920 495M  4.5G  43G
 
nxt2131              ok  20.2  20.6  20.4 100%  0.0  0  2920 495M  4.7G 42.8G
 
nxt2137              ok  8.0  8.4  8.4  40%  0.0  0  2920 495M  4.7G 52.7G
 
nxt1220              ok  8.0  8.0  8.0  40%  0.0  0  1959 497M  4.7G 52.7G
 
nxt1221              ok  8.0  8.0  8.1  40%  0.0  0  1959 497M  4.7G 52.7G
 
</pre>
 
 
 
The above imbalance may be there by design or poor programming. If not by design, you should
 
investigate further, enlisting the intervention, if need be, of our Helpdesk staff. The effective
 
use of nodes by jobs causes them to finish sooner. This improves the whole system's efficiency.
 
 
 
Below we list the homemade script, '''node_use''', for tracking a job's node usage
 
 
 
<pre>
 
#/bin/bash
 
# usage: node_use jobid; Nov 2014. For use on Ada only.
 
echo 'HOST_NAME      status  r15s  r1m  r15m  ut    pg  ls    it  tmp  swp  mem'
 
#
 
bjobs -l $1 | egrep -i "HOST: " | sed '1,$s/HOST: //g' | \
 
while read node_ID
 
do
 
    lsload $node_ID | sed '/^HOST/d'
 
done
 
 
 
</pre>
 

Latest revision as of 14:31, 18 June 2020

Job Submission

Once you have your job file ready, it is time to submit your job. You can submit your job to LSF with the following command:

[ NetID@ada ~]$ bsub < MyJob.LSF
Verifying job submission parameters...
Verifying project account...
     Account to charge:   123456789123
         Balance (SUs):      5000.0000
         SUs to charge:         5.0000
Job <12345> is submitted to default queue <sn_regular>.

tamubatch

tamubatch is an automatic batch job script that submits jobs for the user without the need of writing a batch script on the Ada and Terra clusters. The user just needs to provide the executable commands in a text file and tamubatch will automatically submit the job to the cluster. There are flags that the user may specify which allows control over the parameters for the job submitted.

tamubatch is still in beta and has not been fully developed. Although there are still bugs and testing issues that are currently being worked on, tamubatch can already submit jobs to both the Ada and Terra clusters if given a file of executable commands.

For more information, visit this page.

tamulauncher

tamulauncher provides a convenient way to run a large number of serial or multithreaded commands without the need to submit individual jobs or a Job array. User provides a text file containing all commands that need to be executed and tamulauncher will execute the commands concurrently. The number of concurrently executed commands depends on the batch requirements. When tamulauncher is run interactively the number of concurrently executed commands is limited to at most 8. tamulauncher is available on terra, ada, and curie. There is no need to load any module before using tamulauncher. tamulauncher has been successfully tested to execute over 100K commands.

tamulauncher is preferred over Job Arrays to submit a large number of individual jobs, especially when the run times of the commands are relatively short. It allows for better utilization of the nodes, puts less burden on the batch scheduler, and lessens interference with jobs of other users on the same node.

For more information, visit this page.