Ada:Batch Job Submission
Job Submission: the bsub command
bsub < jobfile # Submits specified job for processing by LSF
Here is an illustration,
$ bsub < sample1.job Verifying job submission parameters... Job <224139> is submitted to default queue <devel>.
The first thing LSF does upon submission is to tag your job with a numeric identifier, a job id.
Above, that identifier is 224139. You will need it in order to track or manage (kill or modify)
your jobs. Next, note that the default current working directory for the job is the directory
you submitted the job from. If that's not what you need, you must explicitly indicate that, as we
do above when we cd into a specific directory. On job completion, LSF will place in the submission
directory the file stdout1.224139. It contains a log of job events and other data directed to
standard out. Always inspect this file for useful information.
By default, a job executes under the environment of the submitting process. This you can change by using the -L shell option (see below) and/or by specifying at the start of the job script the shell that will execute it. For example, if you want the job to execute under the C-shell, the first command above the #BSUB directives should be #!/bin/csh.
Job tracking and control commands
bjobs [-u all or user_name] [[-l] job_id] # displays job information per user(s) or job_id, in summary or detail (-l) form, respectively. bpeek [-f] job_id # displays the current contents of stdout and stderr output of an executing job. bkill job_id # kills, suspends, or resumes unfinished jobs. See man bkill for details. bmod [bsub_options] job_id # Modifies job submission options of a job. See man bmod for details. lsload [node_name] # Lists on std out a node's utilization. Use bjobs -l jobid # to get the names of nodes associated with a jobid. See man lsload for details.
All of the above have decent man pages, if you're interested in more detail.
Examples
$ bjobs -u all JOBID STAT USER QUEUE JOB_NAME NEXEC_HOST SLOTS RUN_TIME TIME_LEFT 223537 RUN adinar long NOR_Q 1 20 400404 second(s) 8:46 L 223547 RUN adinar long NOR_Q 1 20 399830 second(s) 8:56 L 223182 RUN tengxj1025 long pro_at16_lowc 10 280 325922 second(s) 5:27 L 229307 RUN natalieg long LES_MORE 3 900 225972 second(s) 25:13 L 229309 RUN tengxj1025 long pro_atat_lowc 7 280 223276 second(s) 33:58 L 229310 RUN tengxj1025 long cg16_lowc 5 280 223228 second(s) 33:59 L . . . . . . . . . $ bjobs -l 229309 Job <229309>, Job Name <pro_atat_lowc>, User <tengxj1025>, Project <default>, M ail <czjnbb@gmail.com>, Status <RUN>, Queue <long>, J ob Priority <250000>, Command <## job name;#BSUB -J p ro_atat_lowc; ## send stderr and stdout to the same f ile ;#BSUB -o info.%J; ## login shell to avoid copyin g env from login session;## also helps the module fun ction work in batch jobs;#BSUB -L /bin/bash; ## 30 mi nutes of walltime ([HH:]MM);#BSUB -W 96:00; ## numpro cs;#BSUB -n 280; . . . . . . RUNLIMIT 5760.0 min of nxt1449 Tue Nov 4 21:34:43 2014: Started on 280 Hosts/Processors <nxt1449> <nxt1449> < nxt1449> <nxt1449> <nxt1449> <nxt1449> ... . . . Execution CWD </scratch/user/tengxj1025/EXTD/pro_atat/lowc/md>; Fri Nov 7 12:05:55 2014: Resource usage collected. The CPU time used is 67536997 seconds. MEM: 44.4 Gbytes; SWAP: 0 Mbytes; NTHREAD: 862 HOST: nxt1449 MEM: 3.2 Gbytes; SWAP: 0 Mbytes; CPU_TIME: 9004415 s econds . . . . . . . . . $ bmod -W 46:00 229309 # resets wall-clock time to 46 hrs for job 229309
Node Utilization. It may happen that a job uses its allocated nodes inefficiently.
Sometimes this is unavoidable, but many times it is very avoidable. It is unavoidable, for instance, if the
amount of memory used per node is a large fraction of the total for that node, and only 1 cpu is used. In
that case, cpu utilization will be at best at 5% (1/20) in a regular node. A handy tool, more practical than lsload,
for tracking node utilization is the lnu homegrown command.
lnu [-h] [-l] -j jobid # lists on stdout the utilization across all nodes for an executing job. See examples below.
Examples
$ lnu -l -j 795375 Job User Queue Status Node Cpus 795375 jomber23 medium R 4 80 HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem Assigned Cores nxt1417 ok 20.0 21.0 21.0 97% 0.0 0 94976 366M 3.7G 41.6G 20 nxt1764 (L) ok 19.7 20.0 20.0 95% 0.0 0 95040 366M 3.7G 41.5G 20 nxt2111 ok 20.0 20.0 20.0 98% 0.0 0 91712 370M 4.2G 41.5G 20 nxt2112 ok 20.0 21.1 21.0 97% 0.0 0 91712 370M 4.2G 41.6G 20 ========================================================================================================= $ lnu -l -j 753454 Job User Queue Status Node Cpus 753454 ajochoa long R 1 20 HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem Assigned Cores nxt1222 (L) ok 4.3 4.5 6.2 20% 0.0 0 54464 422M 4.7G 52.9G 20 =========================================================================================================
The utilization (ut) and memory paging (pg), overall, are probably the most significant. Note that the tmp, swp, and mem refer to available amounts respectively.