|
|
Line 1: |
Line 1: |
− | ==Job Submission: the bsub command== | + | == Job Submission == |
− | <pre>
| + | Once you have your job file ready, it is time to submit your job. You can submit your job to LSF with the following command: |
− | bsub < jobfile # Submits specified job for processing by LSF | + | [ NetID@ada ~]$ '''bsub < ''MyJob.LSF''''' |
− | </pre> | + | Verifying job submission parameters... |
| + | Verifying project account... |
| + | Account to charge: 123456789123 |
| + | Balance (SUs): 4824.7811 |
| + | SUs to charge: 5.0000 |
| + | Job <12345> is submitted to default queue <sn_regular>. |
| | | |
− | Here is an illustration,
| + | After a job has been submitted, you may want to check on its progress or cancel it. Below is a list of the most used job monitoring and control commands for jobs on Terra. |
| | | |
− | <pre> | + | {| class="wikitable" style="text-align: center;" |
− | $ bsub < sample1.job
| + | |+ Ada (LSF) Job Monitoring and Control Commands |
− | Verifying job submission parameters...
| + | ! style="width: 200pt;" | Function |
− | Job <224139> is submitted to default queue <devel>.
| + | ! style="width: 200pt;" | Command |
− | </pre>
| + | ! style="width: 150pt;" | Example |
| + | |- |
| + | |Submit a job |
| + | |bsub < [script_file] |
| + | |bsub < MyJob.LSF |
| + | |- |
| + | |Cancel/Kill a job |
| + | |bkill [Job_ID] |
| + | |bkill 101204 |
| + | |- |
| + | |Check summary status of a single job |
| + | |bjobs [job_id] |
| + | |bjobs 101204 |
| + | |- |
| + | |Check summary status of all <br> jobs for a user |
| + | |bjobs -u [user_name] |
| + | |bjobs -u adaUser1 |
| + | |- |
| + | |Check detailed status of a single job |
| + | |bjobs -l [job_id] |
| + | |bjobs -l 101204 |
| + | |- |
| + | |Modify job submission options |
| + | |bmod [bsub_options] [job_id] |
| + | |bmod -W 2:00 101204 |
| + | |} |
| | | |
− | The first thing LSF does upon submission is to tag your job with a numeric identifier, a job id.
| + | For more information on any of the commands above, please see their respective ''man'' pages. |
− | Above, that identifier is '''224139'''. You will need it in order to track or manage (kill or modify)
| + | [ NetID@ada ~]$ '''man bmod''' |
− | your jobs. Next, note that the default current working directory for the job is the directory
| |
− | you submitted the job from. If that's not what you need, you must explicitly indicate that, as we
| |
− | do above when we cd into a specific directory. On job completion, LSF will place in the submission
| |
− | directory the file stdout1.224139. It contains a log of job events and other data directed to
| |
− | standard out. Always inspect this file for useful information.<br>
| |
− | | |
− | By default, a job executes under the environment of the submitting process. This you can change
| |
− | by using the '''-L shell''' option (see below) and/or by specifying at the start of the job script
| |
− | the shell that will execute it. For example, if you want the job to execute under the C-shell, the first
| |
− | command above the #BSUB directives should be #!/bin/csh.
| |
− | | |
− | | |
− | ===Job tracking and control commands===
| |
− | | |
− | <pre>
| |
− | bjobs [-u all or user_name] [[-l] job_id] # displays job information per user(s) or job_id, in summary or detail (-l) form, respectively.
| |
− | bpeek [-f] job_id # displays the current contents of stdout and stderr output of an executing job.
| |
− | bkill job_id # kills, suspends, or resumes unfinished jobs. See man bkill for details.
| |
− | bmod [bsub_options] job_id # Modifies job submission options of a job. See man bmod for details.
| |
− | lsload [node_name] # Lists on std out a node's utilization. Use bjobs -l jobid
| |
− | # to get the names of nodes associated with a jobid. See man lsload for details.
| |
− | </pre>
| |
− | | |
− | All of the above have decent man pages, if you're interested in more detail.
| |
− | | |
− | '''Examples''' | |
− | <pre>
| |
− | $ bjobs -u all
| |
− | JOBID STAT USER QUEUE JOB_NAME NEXEC_HOST SLOTS RUN_TIME TIME_LEFT
| |
− | 223537 RUN adinar long NOR_Q 1 20 400404 second(s) 8:46 L
| |
− | 223547 RUN adinar long NOR_Q 1 20 399830 second(s) 8:56 L
| |
− | 223182 RUN tengxj1025 long pro_at16_lowc 10 280 325922 second(s) 5:27 L
| |
− | 229307 RUN natalieg long LES_MORE 3 900 225972 second(s) 25:13 L
| |
− | 229309 RUN tengxj1025 long pro_atat_lowc 7 280 223276 second(s) 33:58 L
| |
− | 229310 RUN tengxj1025 long cg16_lowc 5 280 223228 second(s) 33:59 L
| |
− | . . . . . . . . .
| |
− | | |
− | $ bjobs -l 229309
| |
− | | |
− | Job <229309>, Job Name <pro_atat_lowc>, User <tengxj1025>, Project <default>, M
| |
− | ail <czjnbb@gmail.com>, Status <RUN>, Queue <long>, J
| |
− | ob Priority <250000>, Command <## job name;#BSUB -J p
| |
− | ro_atat_lowc; ## send stderr and stdout to the same f
| |
− | ile ;#BSUB -o info.%J; ## login shell to avoid copyin
| |
− | g env from login session;## also helps the module fun
| |
− | ction work in batch jobs;#BSUB -L /bin/bash; ## 30 mi
| |
− | nutes of walltime ([HH:]MM);#BSUB -W 96:00; ## numpro
| |
− | cs;#BSUB -n 280; . . .
| |
− | . . .
| |
− | | |
− | RUNLIMIT
| |
− | 5760.0 min of nxt1449
| |
− | Tue Nov 4 21:34:43 2014: Started on 280 Hosts/Processors <nxt1449> <nxt1449> <
| |
− | nxt1449> <nxt1449> <nxt1449> <nxt1449> ...
| |
− | . . .
| |
− | | |
− | Execution
| |
− | CWD </scratch/user/tengxj1025/EXTD/pro_atat/lowc/md>;
| |
− | Fri Nov 7 12:05:55 2014: Resource usage collected.
| |
− | The CPU time used is 67536997 seconds.
| |
− | MEM: 44.4 Gbytes; SWAP: 0 Mbytes; NTHREAD: 862
| |
− | | |
− | HOST: nxt1449
| |
− | MEM: 3.2 Gbytes; SWAP: 0 Mbytes; CPU_TIME: 9004415 s
| |
− | econds . . .
| |
− | . . .
| |
− | . . .
| |
− | | |
− | | |
− | $ bmod -W 46:00 229309 # resets wall-clock time to 46 hrs for job 229309
| |
− | | |
− | | |
− | </pre>
| |
− | | |
− | | |
− | '''Node Utilization.''' It may happen that a job uses its allocated nodes inefficiently.
| |
− | Sometimes this is unavoidable, but many times it is very avoidable. It is unavoidable, for instance, if the
| |
− | amount of memory used per node is a large fraction of the total for that node, and only 1 cpu is used. In
| |
− | that case, cpu utilization will be at best at 5% (1/20) in a regular node. A handy tool, more practical than lsload,
| |
− | for tracking node utilization is the '''lnu''' homegrown command.
| |
− | | |
− | <pre>
| |
− | lnu [-h] [-l] -j jobid # lists on stdout the utilization across all nodes for an executing job. See examples below.
| |
− | </pre>
| |
− | | |
− | '''Examples'''
| |
− | | |
− | <pre>
| |
− | $ lnu -l -j 795375
| |
− | Job User Queue Status Node Cpus
| |
− | 795375 jomber23 medium R 4 80
| |
− | HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem Assigned Cores
| |
− | nxt1417 ok 20.0 21.0 21.0 97% 0.0 0 94976 366M 3.7G 41.6G 20
| |
− | nxt1764 (L) ok 19.7 20.0 20.0 95% 0.0 0 95040 366M 3.7G 41.5G 20
| |
− | nxt2111 ok 20.0 20.0 20.0 98% 0.0 0 91712 370M 4.2G 41.5G 20
| |
− | nxt2112 ok 20.0 21.1 21.0 97% 0.0 0 91712 370M 4.2G 41.6G 20
| |
− | =========================================================================================================
| |
− | | |
− | $ lnu -l -j 753454 | |
− | Job User Queue Status Node Cpus
| |
− | 753454 ajochoa long R 1 20
| |
− | HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem Assigned Cores
| |
− | nxt1222 (L) ok 4.3 4.5 6.2 20% 0.0 0 54464 422M 4.7G 52.9G 20
| |
− | =========================================================================================================
| |
− | | |
− | </pre>
| |
− | | |
− | The utilization ('''ut''') and memory paging ('''pg'''), overall, are probably the most significant. Note that the
| |
− | '''tmp, swp,''' and '''mem''' refer to ''available'' amounts respectively.
| |