Difference between revisions of "Ada:Batch Job Submission"
m (→Environment Variables)
m (→Common BSUB Options)
|Line 72:||Line 72:|
====Common BSUB Options====
====Common BSUB Options====
Revision as of 19:12, 13 January 2015
Job Submission: the bsub command
bsub < jobfile # Submits specified job for processing by LSF
Here is an illustration,
Three important job parameters:
#BSUB -n NNN # NNN: total number of cpus to allocate for the job #BSUB -R "span[ptile=XX]" # XX: number of cores/cpus per node to use #BSUB -R "select[node-type]" # node-type: nxt, mem256gb, gpu, phi, mem1t, mem2t ...
It is worth emphasizing that, under the current LSF setup, only the -x option and a ptile value equal to the node's core limit will prevent LSF from scheduling jobs that match the balance of unreserved cores.
Common BSUB Options
-J job name # sets the job name. -L /bin/bash # uses the bash login shell to initialize the job's execution environment. -W hh:mm # sets job's runtime wall-clock limit. A simple xx specification indicates minutes -M #### # sets the per process memory limit in MBs. The per job memory limit then is number of cores * ##### -n #### # assigns number of job slots/cores -x # assigns a whole node (same node as above) exclusively for the job. -o filename. # directs the job's standard output to name. The special string, %J, attaches the jobid -P project name # charges the consumed service units (SUs) to project specified. -u e-mail_addr # sends email to the specified address (e.g., email@example.com, firstname.lastname@example.org) with information about main job events.
In the following four job scripts, we illustrate in four different ways the execution of an application program, ABAQUS, to solve the same engineering problem specified
in the s4b.inp input file. The latter can be copied from the "Examples" database of ABAQUS by using the fetch option. Keep in mind, please, that not all problems
specified via ABAQUS are amenable to different types of effective parallelization.
It is very important when running packaged code that the resource parameters (e.g., cpus, memory, gpu) you specify via BSUB directives are in agreement with their
counterparts on the application's command line. It turns out that the engineering problem described in s4b.inp shows remarkable improvement in performance as
we try different modes of execution: serial, GPU, OpenMP, and finally to MPI.
Example 2 (Serial)
#BSUB -J s4b_serial -o s4b_serial.%J -W 400 -L /bin/bash -n 1 -R 'span[ptile=1]' -M 42000 -R 'select[nxt]' ## 1 * 42,000MB = 42 GB mem_limit # mkdir $SCRATCH/abaqus; cd $SCRATCH/abaqus # module load ictce module load ABAQUS # abaqus fetch job=s4b.inp ## abaqus analysis job=s4b_serial_nxt input=s4b.inp memory="32 gb" double scratch=$SCRATCH/abaqus
Example 3 (OpenMP)
#BSUB -J s4b_smp -o s4b_smp.%J -L /bin/bash -W 40 -n 20 -R 'span[ptile=20]' -M 20000 -R 'select[nxt]' ## 20*2000MB = 40,000MB = 40 GB ## OpenMP/Multi-threaded run on 20 cores # mkdir $SCRATCH/abaqus cd $SCRATCH/abaqus # module load ictce module load ABAQUS # abaqus fetch job=s4b.inp ## abaqus analysis job=s4b_smp input=s4b.inp mp_mode=threads cpus=20 memory="32 gb" double scratch=$SCRATCH/abaqus #
Example 3 (MPI)
#BSUB -J s4b_mpi64 -o s4b_mpi64.%J -L /bin/bash -W 200 -n 64 -R 'span[ptile=16]' -M 2500 -x ## ## runs a 64-way mpi job, 16-core per node, across 4 nodes. Total memory limit, 64 * 2500 MB = 160,000MB =160 GB # mkdir $SCRATCH/abaqus cd $SCRATCH/abaqus # module load ictce module load ABAQUS # abaqus fetch job=s4b.inp # abaqus analysis job=s4b_mpi64 input=./s4b.inp mp_mode=mpi cpus=64 memory="150 gb" double scratch=$SCRATCH/abaqus #
Example 4 (GPU)
#BSUB -J s4b_gpu -o s4b_gpu.%J -L /bin/bash -W 40 -n 1 -R 'span[ptile=1]' -M 400000 -R 'select[gpu256gb]' ## 1*40,000MB = 40GB mkdir $SCRATCH/abaqus cd $SCRATCH/abaqus # module load ictce module load ABAQUS # abaqus fetch job=s4b.inp ## abaqus analysis job=s4b_gpu input=s4b.inp gpus=1 memory="32 gb" double scratch=$SCRATCH/abaqus #
When LSF selects and activates a node for the running of your job, by default, it duplicates the environment the job was submitted from. That
environment in the process of your work may have been altered by you (e.g., by loading some modules or setting up new or changing some standard environment variables)
to be different from that that the login created. The next job you submit, however, may require a different execution environment. Hence the
recommendation that, in submitting jobs, specify the creation of a new login shell and within the job explicitly customize the environment as needed.
A new login shell per job is initialized by specifying the #BSUB -L /bin/bash option.
All the nodes enlisted for the execution of a job carry most of the environment variables the login process created: HOME, PWD, PATH, USER, etc. In addition, LSF defines new ones in the environment of an executing job. Below, we show an abbreviated list.
LSB_QUEUE: The name of the queue the job is dispatched from. LSB_JOBNAME: Name of the job. LSB_JOBID: Batch job ID assigned by LSF. LSB_ERRORFILE: Name of the error file specified with a bsub -e. LSB_HOSTS: The list of nodes (their LSF symbolic names) that are used to run the batch job. A node name is repeated as many times as needed to equal the specified ptile value. The memory size of LSB_HOSTS variable is limited to 4096 bytes. LSB_MCPU_HOSTS: The list of nodes (their LSF symbolic names) ) and the specified or default ptile value per node to run the batch job. This can be relied upon to contain the names of all the deployed hosts.
The following is a Linux script to be used within a job to periodically track the load level on each of the allocated nodes.
#!/bin/bash echo 'HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem' echo $LSB_MCPU_HOSTS | sed 's/ [1-4]./''\n''/g' | \ while read node_id do lsload -l $node_id | sed '^HOST/d' done
Job tracking and control commands
bjobs [-u all or user_name] [[-l] job_id] # displays job information per user(s) or job_id, in summary or detail (-l) form. bpeek [-f] job_id # displays the stdout and stderr output of an unfinished job. bkill job_id # kills, suspends, or resumes unfinished jobs. See man bkill for details. bmod [bsub_options] job_id # Modifies job submission options of a job. See man bmod for details. lsload [node_name] # Lists on std out a node's utilization. Use bjobs -l jobid # to get the names of nodes associated with a jobid. See man lsload for details.
[userx@login4]$ bjobs -u all JOBID STAT USER QUEUE JOB_NAME NEXEC_HOST SLOTS RUN_TIME TIME_LEFT 223537 RUN adinar long NOR_Q 1 20 400404 second(s) 8:46 L 223547 RUN adinar long NOR_Q 1 20 399830 second(s) 8:56 L 223182 RUN tengxj1025 long pro_at16_lowc 10 280 325922 second(s) 5:27 L 229307 RUN natalieg long LES_MORE 3 900 225972 second(s) 25:13 L 229309 RUN tengxj1025 long pro_atat_lowc 7 280 223276 second(s) 33:58 L 229310 RUN tengxj1025 long cg16_lowc 5 280 223228 second(s) 33:59 L . . . . . . . . . [userx@login4]$ bjobs -l 229309 Job <229309>, Job Name <pro_atat_lowc>, User <tengxj1025>, Project <default>, M ail <email@example.com>, Status <RUN>, Queue <long>, J ob Priority <250000>, Command <## job name;#BSUB -J p ro_atat_lowc; ## send stderr and stdout to the same f ile ;#BSUB -o info.%J; ## login shell to avoid copyin g env from login session;## also helps the module fun ction work in batch jobs;#BSUB -L /bin/bash; ## 30 mi nutes of walltime ([HH:]MM);#BSUB -W 96:00; ## numpro cs;#BSUB -n 280; . . . . . . RUNLIMIT 5760.0 min of nxt1449 Tue Nov 4 21:34:43 2014: Started on 280 Hosts/Processors <nxt1449> <nxt1449> < nxt1449> <nxt1449> <nxt1449> <nxt1449> ... . . . Execution CWD </scratch/user/tengxj1025/EXTD/pro_atat/lowc/md>; Fri Nov 7 12:05:55 2014: Resource usage collected. The CPU time used is 67536997 seconds. MEM: 44.4 Gbytes; SWAP: 0 Mbytes; NTHREAD: 862 HOST: nxt1449 MEM: 3.2 Gbytes; SWAP: 0 Mbytes; CPU_TIME: 9004415 s econds . . . . . . . . . [userx@login4]$ bmod -W 46:00 229309 # resets wall-clock time to 46 hrs for job 229309
The lsload command & Node utilization. It may happen that a job uses its allocated nodes inefficiently. Sometimes this is unavoidable, but many times it is very avoidable. It is unavoidable, for instance, if the amount of memory used per node is a large fraction of the total for that node, and only 1 cpu is used. In that case, cpu utilization will be at best at 5% (1/20) in a regular node. The main tool for tracking node utilization is the lsload command.
lsload [node_name] # Lists on std out a node's utilization. (Focus on ut and pg columns). Use bjobs -l jobid # to get the names of nodes associated with a jobid.
Below we list the output from the homemade shell script, node_use, based on lsload, which uses as input the jobid. The 6 nodes attached to job 260291 exhibit fairly uneven usage (ut column): the first three nodes versus the bottom three. Any non-zero values for the pg column would also have been a point of concern. It would signify potential memory paging issues with consequent slowdowns.
./node_use 260291 HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem nxt1739 ok 20.4 21.1 20.9 100% 0.0 0 6224 491M 4.7G 43.2G nxt2130 ok 20.3 20.1 19.8 100% 0.0 0 2920 495M 4.5G 43G nxt2131 ok 20.2 20.6 20.4 100% 0.0 0 2920 495M 4.7G 42.8G nxt2137 ok 8.0 8.4 8.4 40% 0.0 0 2920 495M 4.7G 52.7G nxt1220 ok 8.0 8.0 8.0 40% 0.0 0 1959 497M 4.7G 52.7G nxt1221 ok 8.0 8.0 8.1 40% 0.0 0 1959 497M 4.7G 52.7G
The above imbalance may be there by design or poor programming. If not by design, you should investigate further, enlisting the intervention, if need be, of our Helpdesk staff. The effective use of nodes by jobs causes them to finish sooner. This improves the whole system's efficiency.
Below we list the homemade script, node_use, for tracking a job's node usage
#/bin/bash # usage: node_use jobid; Nov 2014. For use on Ada only. echo 'HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem' # bjobs -l $1 | egrep -i "HOST: " | sed '1,$s/HOST: //g' | \ while read node_ID do lsload $node_ID | sed '/^HOST/d' done