|
|
(98 intermediate revisions by 9 users not shown) |
Line 1: |
Line 1: |
− | ==Job Submission: the bsub command== | + | == Job Submission == |
− | <pre>
| + | Once you have your job file ready, it is time to submit your job. You can submit your job to LSF with the following command: |
− | bsub < jobfile # Submits specified job for processing by LSF | + | [ NetID@ada ~]$ '''bsub < ''MyJob.LSF''''' |
− | </pre> | + | Verifying job submission parameters... |
| + | Verifying project account... |
| + | Account to charge: 123456789123 |
| + | Balance (SUs): 5000.0000 |
| + | SUs to charge: 5.0000 |
| + | Job <12345> is submitted to default queue <sn_regular>. |
| | | |
− | Here is an illustration,
| + | == tamubatch == |
| | | |
− | <pre>
| + | '''tamubatch''' is an automatic batch job script that submits jobs for the user without the need of writing a batch script on the Ada and Terra clusters. The user just needs to provide the executable commands in a text file and tamubatch will automatically submit the job to the cluster. There are flags that the user may specify which allows control over the parameters for the job submitted. |
− | $ bsub < sample1.job
| |
− | Verifying job submission parameters...
| |
− | Job <224139> is submitted to default queue <devel>.
| |
− | [userx@login4]$
| |
− | </pre>
| |
| | | |
− | The first thing LSF does upon submission is to tag your job with a numeric identifier, a job id.
| + | ''tamubatch is still in beta and has not been fully developed. Although there are still bugs and testing issues that are currently being worked on, tamubatch can already submit jobs to both the Ada and Terra clusters if given a file of executable commands. '' |
− | Above, that identifier is '''224139'''. You will need it in order to track or manage (kill or modify)
| |
− | your jobs. Next, note that the default current working directory for the job is the directory
| |
− | you submitted the job from. If that's not what you need, you must explicitly indicate that, as we
| |
− | do above when we cd into a specific directory. On job completion, LSF will place in the submission
| |
− | directory the file stdout1.224139. It contains a log of job events and other data directed to
| |
− | standard out. Always inspect this file for useful information.<br>
| |
| | | |
− | By default, a job executes under the environment of the submitting process. This you can change
| + | For more information, visit [https://hprc.tamu.edu/wiki/SW:tamubatch this page.] |
− | by using the '''-L shell''' option (see below) and/or by specifying at the start of the job script
| |
− | the shell that will execute it. For example, if you want the job to execute under the C-shell, the first
| |
− | command after the #BSUB directives should be #!/bin/csh.
| |
| | | |
− | '''Five important job parameters:'''
| + | == tamulauncher == |
− | <pre>
| |
− | #BSUB -n NNN # NNN: total number of cpus to allocate for the job
| |
− | #BSUB -R "span[ptile=XX]" # XX: number of cores/cpus per node to use
| |
− | #BSUB -R "select[node-type]" # node-type: nxt, mem256gb, gpu, phi, mem1t, mem2t ...
| |
− | #BSUB -R "rusage[mem=nnn]" # selects nodes that each has at least XX * nnn MBs of memory available.
| |
− | #BSUB -W nnn # sets the per process enforceable memory limit to nnn MB.
| |
− | </pre>
| |
| | | |
− | We list these together because in many jobs they can be closely related and, therefore, must be
| + | '''tamulauncher''' provides a convenient way to run a large number of serial or multithreaded commands without the need to submit individual jobs or a Job array. User provides a text file containing all commands that need to be executed and tamulauncher will execute the commands concurrently. The number of concurrently executed commands depends on the batch requirements. When tamulauncher is run interactively the number of concurrently executed commands is limited to at most 8. tamulauncher is available on terra, ada, and curie. There is no need to load any module before using tamulauncher. tamulauncher has been successfully tested to execute over 100K commands. |
− | consistently set. We recommend their adoption in all jobs, serial, single-node and multi-node.
| |
− | Note, that without the '''-R "rusage[mem=nnn]" ''' LSF may select nodes that do not have the nnn MB
| |
− | specified. The ''' -W nnn''' option, on the other hand, limits the amount of memory allocated to a process.
| |
− | When this limit is violated the job will abort. Omitting this specification, causes LSF to assume the
| |
− | default memory limit per process to be 2.5 giga-bytes (2500 MB). The following examples, with some commentary, illustrate the
| |
− | use of these options.
| |
| | | |
− | <pre>
| + | ''tamulauncher is preferred over Job Arrays to submit a large number of individual jobs, especially when the run times of the commands are relatively short. It allows for better utilization of the nodes, puts less burden on the batch scheduler, and lessens interference with jobs of other users on the same node.'' |
− | #BSUB -n 900 # 900: number of cpus to allocate for the job
| |
− | #BSUB -R "span[ptile=20]" # 20: number of cores/cpus per node to use
| |
− | #BSUB -R "select[nxt]" # Allocates NeXtScale nodes
| |
− | </pre>
| |
| | | |
− | The above specifications will allocate 45 (=900/20) whole nodes. In many parallel jobs the selection
| + | For more information, visit [https://hprc.tamu.edu/wiki/SW:tamulauncher#tamulauncher this page.] |
− | of NeXtScale nodes at 20 cores per node is the best choice. Here, the maximum memory per process
| |
− | is set to 2500 MB. Here, we're just illustrating what happens when you omit the memory-related options.
| |
− | We definitely urge that you specify them. The memory enforceable limit per process here is 2.5 MB.
| |
| | | |
− | <pre>
| + | [[ Category:Ada ]] |
− | #BSUB -n 900 # 900: total number of cpus to allocate for the job
| |
− | #BSUB -R "span[ptile=16]" # 16: number of cores/cpus per node to use
| |
− | #BSUB -R "select[nxt]" # allocates exclusively whole NeXtScale nodes
| |
− | #BSUB -R "rusage[mem=3600]" # schedules on nodes that have at least 16 * 3600 = 57,600 MB avail
| |
− | #BSUB -W 3600 # lim its (and enforces) 3600 MB memory use per process and/or 57,600 MB per node
| |
− | </pre>
| |
− | | |
− | The above specifications will allocate 57 (= ceiling(900/16)) nodes. The decision to apply only apply xx (here 16) number cores
| |
− | per node, and not the maximum 20, for a computation requires some judgement. The execution profile of the job is
| |
− | important. Typically, some experimentation is required in finding the optimal tile number for a given code.
| |
− | | |
− | <pre>
| |
− | #BSUB -n 1 # Allocate a total of 1 cpu/core for the job, appropriate for serial processing.
| |
− | #BSUB -R "span[ptile=1]" # Allocate 1 cpu per node.
| |
− | #BSUB -R "select[gpu]" # Make the allocated node have gpus, of 64GB or 256GB memory. A "select[phi]"
| |
− | # specification would allocate a node with phi coprocessors.
| |
− | </pre>
| |
− | | |
− | Omitting the last two options in the above will cause LSF to place the job on any conveniently available
| |
− | core on any node, idle or (partially) busy, of any type, except on those with 1TB or 2TB memory.<br>
| |
− | | |
− | It is worth emphasizing that, under the current LSF setup, only the '''-x''' option and a ptile value equal to the node's
| |
− | core limit will prevent LSF from scheduling jobs that match the balance of unreserved cores.
| |
− | | |
− | '''Inhomogeneous Node Selection'''
| |
− | <pre>
| |
− | #BSUB -n 900 # allocate a total of 900 cores/job_slots to the job
| |
− | #BSUB -R "600*{ select[nxt] rusage[mem=2500] span[ptile=20]} + 300*{ select[gpu] rusage[mem=12000] span[ptile=20] }"
| |
− | #BSUB -W 12000 # sets the per process enforceable memory limit to 12000 MB.
| |
− | </pre>
| |
− | | |
− | The above specification will allocate 30 NeXtScale and 15 iDataPlex nodes, the latter with GPUs, at 20 cores per node. Note
| |
− | that the enforceable memory limit here 12 gb per process.
| |
− | | |
− | ====Common BSUB Options====
| |
− | <pre>
| |
− | -J job name - sets the job name.
| |
− | -L shell - uses the Unix Shell specified to initialize the job's execution environment. We strongly recommend that the
| |
− | setting be /bin/bash. If not specified, the job inherits the environment of the submitting process.
| |
− | -W hh:mm or -mm - sets job's runtime wall-clock limit in hours:minutes or just minutes (-mm).
| |
− | -M men_limit - sets the per process memory limit in mega-bytes (MBs). The job's memory limit then is num_cores * men_limit.
| |
− | This limit is enforced; that is, when violated the jobs aborts.
| |
− | -n num_cores - assigns number of job slots/cores.
| |
− | -x - assigns a whole node (same node as above) exclusively for the job. The SUs charged reflect use of all the cores in a node.
| |
− | -o filename - directs the job's standard output to name. The special string, %J, attaches the jobid.
| |
− | -P project_name - charges the consumed service units (SUs) to the project specified.
| |
− | -u e-mail_addr - sends email to the specified address (e.g., netid@tamu.edu, myname@gmail.com) with information about main job events.
| |
− | </pre>
| |
− | | |
− | ====More Examples====
| |
− | In the following four job scripts, we illustrate in four different ways the execution of an application program, ABAQUS, to solve the same engineering problem specified
| |
− | in the '''s4b.inp''' input file. The latter can be copied from the "Examples" database of ABAQUS by using the '''fetch''' option. Keep in mind, please, that not all problems
| |
− | specified via ABAQUS are amenable to different types of effective parallelization. <br>
| |
− | | |
− | It is very important when running packaged code that the resource parameters (e.g., cpus, memory, gpu) you specify via BSUB directives are in agreement with their
| |
− | counterparts on the application's command line. It turns out that the engineering problem described in s4b.inp shows remarkable improvement in performance as
| |
− | we try different modes of execution: serial, GPU, OpenMP, and finally to MPI.<br>
| |
− | | |
− | '''Example 2 (Serial)'''
| |
− | | |
− | <pre>
| |
− | #BSUB -J s4b_serial -o s4b_serial.%J -W 400 -L /bin/bash -n 1 -R "span[ptile=1] rusage[mem=42000]" -M 42000 -R 'select[nxt]'
| |
− | ## 1 * 42,000MB = 42 GB mem_limit
| |
− | #
| |
− | mkdir $SCRATCH/abaqus; cd $SCRATCH/abaqus
| |
− | #
| |
− | module load ictce
| |
− | module load ABAQUS
| |
− | #
| |
− | abaqus fetch job=s4b.inp
| |
− | ##
| |
− | # The deafault number of cores/cpus is, as per ABAQUS, equal to 1. Hence, the "cpus=" option is omitted below.
| |
− | #
| |
− | abaqus analysis job=s4b_serial input=s4b.inp memory="42 gb" double scratch=$SCRATCH/abaqus
| |
− | </pre>
| |
− | | |
− | '''Example 3 (OpenMP)'''
| |
− | | |
− | <pre>
| |
− | #BSUB -J s4b_smp -o s4b_smp.%J -L /bin/bash -W 40 -n 20 -R "span[ptile=20] rusage[mem=20000]" -M 20000 -R "select[nxt]"
| |
− | ## 20*2000MB = 40,000MB = 40 GB total
| |
− | ## OpenMP/Multi-threaded run on 20 cores
| |
− | #
| |
− | mkdir $SCRATCH/abaqus
| |
− | cd $SCRATCH/abaqus
| |
− | #
| |
− | module load ictce
| |
− | module load ABAQUS
| |
− | #
| |
− | abaqus fetch job=s4b.inp
| |
− | ##
| |
− | # The mp_mode=threads setting signifies the deployment of the OpenMP parallelization model.
| |
− | #
| |
− | abaqus analysis job=s4b_smp input=s4b.inp mp_mode=threads cpus=20 memory="40 gb" double scratch=$SCRATCH/abaqus
| |
− | #
| |
− | </pre>
| |
− | | |
− | '''Example 3 (MPI)'''
| |
− | | |
− | <pre>
| |
− | #BSUB -J s4b_mpi64 -o s4b_mpi64.%J -L /bin/bash -W 200 -n 64 -R 'span[ptile=16] rusage[mem=2500]' -M 2500 -x
| |
− | ##
| |
− | ## runs a 64-way mpi job, 16-core per node, across 4 nodes. Total memory limit, 64 * 2500 MB = 160,000MB =160 GB
| |
− | #
| |
− | mkdir $SCRATCH/abaqus
| |
− | cd $SCRATCH/abaqus
| |
− | #
| |
− | module load ictce
| |
− | module load ABAQUS
| |
− | #
| |
− | abaqus fetch job=s4b.inp
| |
− | #
| |
− | abaqus analysis job=s4b_mpi64 input=./s4b.inp mp_mode=mpi cpus=64 memory="150 gb" double scratch=$SCRATCH/abaqus
| |
− | #
| |
− | </pre>
| |
− | | |
− | '''Example 4 (GPU)'''
| |
− | | |
− | <pre>
| |
− | #BSUB -J s4b_gpu -o s4b_gpu.%J -L /bin/bash -W 40 -n 1 -R 'span[ptile=1] rusage[mem=1600000]' -M 1600000 -R 'select[gpu256gb]'
| |
− | ## 1*160,000MB = 160GB
| |
− | mkdir $SCRATCH/abaqus
| |
− | cd $SCRATCH/abaqus
| |
− | #
| |
− | module load ictce
| |
− | module load ABAQUS
| |
− | #
| |
− | abaqus fetch job=s4b.inp
| |
− | ##
| |
− | abaqus analysis job=s4b_gpu input=s4b.inp gpus=1 memory="160 gb" double scratch=$SCRATCH/abaqus
| |
− | #
| |
− | </pre>
| |
− | | |
− | ====Environment Variables====
| |
− | | |
− | When LSF selects and activates a node for the running of your job, by default, it duplicates the environment the job was submitted from. That
| |
− | environment in the process of your work may have been altered by you (e.g., by loading some modules or setting up new or changing some standard environment variables)
| |
− | to be different from that that the login created. The next job you submit, however, may require a different execution environment. Hence the
| |
− | recommendation that, in submitting jobs, specify the creation of a new login shell and within the job explicitly customize the environment as needed.
| |
− | A new login shell per job is initialized by specifying the '''#BSUB -L /bin/bash''' option.<br>
| |
− | | |
− | All the nodes enlisted for the execution of a job carry most of the environment variables the login process created: HOME, PWD, PATH, USER, etc.
| |
− | In addition, LSF defines new ones in the environment of an executing job. Below, we show an abbreviated list.
| |
− | | |
− | <pre>
| |
− | LSB_QUEUE: The name of the queue the job is dispatched from.
| |
− | LSB_JOBNAME: Name of the job.
| |
− | LSB_JOBID: Batch job ID assigned by LSF.
| |
− | LSB_ERRORFILE: Name of the error file specified with a bsub -e.
| |
− | LSB_HOSTS: The list of nodes (their LSF symbolic names) that are used to run the batch job. A node name is repeated
| |
− | as many times as needed to equal the specified ptile value. The memory size of LSB_HOSTS variable is limited to 4096 bytes.
| |
− | LSB_MCPU_HOSTS: The list of nodes (their LSF symbolic names) ) and the specified or default ptile value per node to run the batch job. This
| |
− | can be relied upon to contain the names of all the deployed hosts.
| |
− | LS_SUBCWD: This is the directory the job was submitted from.
| |
− | </pre>
| |
− | | |
− | ==Job tracking and control commands==
| |
− | | |
− | <pre>
| |
− | bjobs [-u all or user_name] [[-l] job_id] # displays job information per user(s) or job_id, in summary or detail (-l) form, respectively.
| |
− | bpeek [-f] job_id # displays the current contents of stdout and stderr output of an executing job.
| |
− | bkill job_id # kills, suspends, or resumes unfinished jobs. See man bkill for details.
| |
− | bmod [bsub_options] job_id # Modifies job submission options of a job. See man bmod for details.
| |
− | lsload [node_name] # Lists on std out a node's utilization. Use bjobs -l jobid
| |
− | # to get the names of nodes associated with a jobid. See man lsload for details.
| |
− | </pre>
| |
− | | |
− | '''Examples'''
| |
− | <pre>
| |
− | $ bjobs -u all
| |
− | JOBID STAT USER QUEUE JOB_NAME NEXEC_HOST SLOTS RUN_TIME TIME_LEFT
| |
− | 223537 RUN adinar long NOR_Q 1 20 400404 second(s) 8:46 L
| |
− | 223547 RUN adinar long NOR_Q 1 20 399830 second(s) 8:56 L
| |
− | 223182 RUN tengxj1025 long pro_at16_lowc 10 280 325922 second(s) 5:27 L
| |
− | 229307 RUN natalieg long LES_MORE 3 900 225972 second(s) 25:13 L
| |
− | 229309 RUN tengxj1025 long pro_atat_lowc 7 280 223276 second(s) 33:58 L
| |
− | 229310 RUN tengxj1025 long cg16_lowc 5 280 223228 second(s) 33:59 L
| |
− | . . . . . . . . .
| |
− | | |
− | $ bjobs -l 229309
| |
− | | |
− | Job <229309>, Job Name <pro_atat_lowc>, User <tengxj1025>, Project <default>, M
| |
− | ail <czjnbb@gmail.com>, Status <RUN>, Queue <long>, J
| |
− | ob Priority <250000>, Command <## job name;#BSUB -J p
| |
− | ro_atat_lowc; ## send stderr and stdout to the same f
| |
− | ile ;#BSUB -o info.%J; ## login shell to avoid copyin
| |
− | g env from login session;## also helps the module fun
| |
− | ction work in batch jobs;#BSUB -L /bin/bash; ## 30 mi
| |
− | nutes of walltime ([HH:]MM);#BSUB -W 96:00; ## numpro
| |
− | cs;#BSUB -n 280; . . .
| |
− | . . .
| |
− | | |
− | RUNLIMIT
| |
− | 5760.0 min of nxt1449
| |
− | Tue Nov 4 21:34:43 2014: Started on 280 Hosts/Processors <nxt1449> <nxt1449> <
| |
− | nxt1449> <nxt1449> <nxt1449> <nxt1449> ...
| |
− | . . .
| |
− | | |
− | Execution
| |
− | CWD </scratch/user/tengxj1025/EXTD/pro_atat/lowc/md>;
| |
− | Fri Nov 7 12:05:55 2014: Resource usage collected.
| |
− | The CPU time used is 67536997 seconds.
| |
− | MEM: 44.4 Gbytes; SWAP: 0 Mbytes; NTHREAD: 862
| |
− | | |
− | HOST: nxt1449
| |
− | MEM: 3.2 Gbytes; SWAP: 0 Mbytes; CPU_TIME: 9004415 s
| |
− | econds . . .
| |
− | . . .
| |
− | . . .
| |
− | | |
− | | |
− | $ bmod -W 46:00 229309 # resets wall-clock time to 46 hrs for job 229309
| |
− | | |
− | | |
− | </pre>
| |
− | | |
− | | |
− | '''Node Utilization.''' It may happen that a job uses its allocated nodes inefficiently.
| |
− | Sometimes this is unavoidable, but many times it is very avoidable. It is unavoidable, for instance, if the
| |
− | amount of memory used per node is a large fraction of the total for that node, and only 1 cpu is used. In
| |
− | that case, cpu utilization will be at best at 5% (1/20) in a regular node. A handy tool, more practical than lsload,
| |
− | for tracking node utilization is the '''lnu''' homegrown command.
| |
− | | |
− | <pre>
| |
− | lnu [-h] [-l] -j jobid # lists on stdout the utilization across all nodes for an executing job. See examples below.
| |
− | </pre>
| |
− | | |
− | '''Examples'''
| |
− | | |
− | <pre>
| |
− | $ lnu -l -j 795375
| |
− | Job User Queue Status Node Cpus
| |
− | 795375 jomber23 medium R 4 80
| |
− | HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem Assigned Cores
| |
− | nxt1417 ok 20.0 21.0 21.0 97% 0.0 0 94976 366M 3.7G 41.6G 20
| |
− | nxt1764 (L) ok 19.7 20.0 20.0 95% 0.0 0 95040 366M 3.7G 41.5G 20
| |
− | nxt2111 ok 20.0 20.0 20.0 98% 0.0 0 91712 370M 4.2G 41.5G 20
| |
− | nxt2112 ok 20.0 21.1 21.0 97% 0.0 0 91712 370M 4.2G 41.6G 20
| |
− | =========================================================================================================
| |
− | | |
− | $ lnu -l -j 753454
| |
− | Job User Queue Status Node Cpus
| |
− | 753454 ajochoa long R 1 20
| |
− | HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem Assigned Cores
| |
− | nxt1222 (L) ok 4.3 4.5 6.2 20% 0.0 0 54464 422M 4.7G 52.9G 20
| |
− | =========================================================================================================
| |
− | | |
− | </pre>
| |
− | | |
− | The utilization ('''ut''') and memory paging ('''pg'''), overall, are probably the most significant. Note that the
| |
− | '''tmp, swp,''' and '''mem''' refer to ''available'' amounts respectively.
| |