Hprc banner tamu.png

Difference between revisions of "Ada:Batch Job Submission"

From TAMU HPRC
Jump to: navigation, search
(Controlling Locality)
(Examples of Submission Options)
Line 69: Line 69:
 
'''#BSUB -n 2 -W 2:30 -P 012345678 -R &quot;select[phi]&quot;'''<br /> This directive will allocate 2 cores on nodes with PHIs. The duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The number of billing units (BUs) used will be charged against the 012345678 project
 
'''#BSUB -n 2 -W 2:30 -P 012345678 -R &quot;select[phi]&quot;'''<br /> This directive will allocate 2 cores on nodes with PHIs. The duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The number of billing units (BUs) used will be charged against the 012345678 project
  
'''#BSUB -n 40 -W 2:30 -P 012345678 -R &quot;span[20]&quot;'''<br /> This directive will allocate 2 nodes and 20 cores on each node. The duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The number of billing units (BUs) used will be charged against the 012345678 project
+
'''#BSUB -n 40 -W 2:30 -P 012345678 -R &quot;span[ptile=20]&quot;'''<br /> This directive will allocate 2 nodes and 20 cores on each node. The duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The number of billing units (BUs) used will be charged against the 012345678 project
  
'''#BSUB -n 50 -W 2:30 -P 012345678 -R &quot;span[20]&quot;'''<br /> This directive will allocate 3 nodes: two nodes have 20 cores each, while the third node has 10 cores. The duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The number of billing units (BUs) used will be charged against the 012345678 project
+
'''#BSUB -n 50 -W 2:30 -P 012345678 -R &quot;span[ptile=20]&quot;'''<br /> This directive will allocate 3 nodes: two nodes have 20 cores each, while the third node has 10 cores. The duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The number of billing units (BUs) used will be charged against the 012345678 project
  
'''#BSUB -n 5 -W 2:30 -P 012345678 -R &quot;span[5]&quot; -R &quot;select[mem1tb]&quot; -q xlarge'''<br /> This directive will allocate 5 cpus on a node with 1TB memory. duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The job is submitted to xlarge queue (for using xtra-large memory nodes: 1 or 2 TB nodes). The number of billing units (BUs) used will be charged against the 012345678 project
+
'''#BSUB -n 5 -W 2:30 -P 012345678 -R &quot;span[ptile=5]&quot; -R &quot;select[mem1tb]&quot; -q xlarge'''<br /> This directive will allocate 5 cpus on a node with 1TB memory. duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The job is submitted to xlarge queue (for using xtra-large memory nodes: 1 or 2 TB nodes). The number of billing units (BUs) used will be charged against the 012345678 project
  
 
==== Requesting Specific Node Type ====
 
==== Requesting Specific Node Type ====

Revision as of 17:34, 14 November 2014


Job Submission: the bsub command

Use the bsub command to submit a job script as shown below:

$ bsub < sample.job 
Job <138733> is submitted to default queue .

When submitted, the job is assigned a unique id. You may refer to the job using only the numerical portion of its id (eg. 138733) with the various batch system commands.

The batch facility on Ada is LSF (Load Sharing Facility) from IBM. To submit a job via LSF, a user should submit a job file which specifies submission options, commands to execute, and if needed, the batch queue to submit to.

A reminder. For the purpose of computation in batch mode, the Ada cluster has 837 nodes that are powered by the Ivy Bridge-EP processor and 15 by the Westemere. The Ivy Bridge-EP-based nodes have 20 cpus/cores each, while the Westmere-EX nodes have 40. Compute nodes, Ivy Bridge-EP or Westmere-EX, have difference memory capacities. Note a small portion of memory on each compute node is used by operating systems, not available to user processes. The above is usefull to bear in mind when constructing batch requests.

Submission Options

Resources needed for program execution are specified by submission options in a job file to submit.

Common Submission Options

Below are the common submission options. These options can be specified as #BSUB options in your job script (recommended). they can also be specified on the command line for the bsub command. The bsub man page describes the available submission options in more detail.

Option Description
-J jobname Name of the job. When used with the -j oe option, the job's output will be directed to a file named jobname.oXXXX where XXXX is the job id.
-L login_shell Shell for interpreting the job script. Recommended shell is /bin/bash.
-n X Number of cores (X) to be assigned to the job
-o output_file_name Specifies the output file name.
-P acctno Specifies the billing account to use for this job. Please consult the AMS documentation for more information.
-q queue_name Directs the submitted job to the queue_name queue. On Ada this option should be exercised only on xlarge queue. Here are more details on queues on Ada.
-R "select[r]" Selects nodes with resource r.
-u email_address(es) The email addresses to send mail about the job. Using an external email address (eg. @tamu.edu, @gmail.com, etc.) is recommended.
-x Specifies that LSF will not schedule other jobs on any of the engaged nodes than the present job. This is a useful option when, for issues of performance for example, the sharing of nodes with other jobs is undesirable. Usage per node will be assessed 20 (or 40) * wall_clock_time.
-W HH:MM The limit on how long the job can run.

Examples of Submission Options

To stress the importance of specifying resources correctly in BSUB directives and because this specification is a frequent source of error, we first present a number of examples.

#BSUB -n 40 -W 2:30
or, equivalently:
#BSUB -n 40
#BSUB -W 2:30
This directive will allocate 40 cpus. The duration (wall-clock time) of execution is specified to be a maximum 2 hours and 30 minutes.

#BSUB -n 40 -M 20 -W 2:30
This directive will allocate 40 cpus, and 20 MB memory per process (cpu). The duration of execution is specified to be a maximum 2 hours and 30 minutes.

#BSUB -n 40 -W -M 20 2:30 -P 012345678
This directive will allocate 40 cpus, and 20 MB memory per process. The duration (wall-clock time) of execution is specified to be a maximum 2 hours and 30 minutes. The number of billing units (BUs) used will be charged against the 0123456788 project

#BSUB -n 2 -W 2:30 -P 012345678 -R "select[gpu]"
This directive will allocate 2 cores on nodes with GPUs. The duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The number of billing units (BUs) used will be charged against the 012345678 project

#BSUB -n 2 -W 2:30 -P 012345678 -R "select[phi]"
This directive will allocate 2 cores on nodes with PHIs. The duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The number of billing units (BUs) used will be charged against the 012345678 project

#BSUB -n 40 -W 2:30 -P 012345678 -R "span[ptile=20]"
This directive will allocate 2 nodes and 20 cores on each node. The duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The number of billing units (BUs) used will be charged against the 012345678 project

#BSUB -n 50 -W 2:30 -P 012345678 -R "span[ptile=20]"
This directive will allocate 3 nodes: two nodes have 20 cores each, while the third node has 10 cores. The duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The number of billing units (BUs) used will be charged against the 012345678 project

#BSUB -n 5 -W 2:30 -P 012345678 -R "span[ptile=5]" -R "select[mem1tb]" -q xlarge
This directive will allocate 5 cpus on a node with 1TB memory. duration (wall-clock time) of execution is specified to be a maximum 2 hours 30 minutes. The job is submitted to xlarge queue (for using xtra-large memory nodes: 1 or 2 TB nodes). The number of billing units (BUs) used will be charged against the 012345678 project

Requesting Specific Node Type

The BSUB "select" option specifies the resource type of nodes to run programs. The table below lists the resource types for BSUB "select" option. This does not apply to remote visualization jobs.

Node Type Needed Job Parameter to Use
General 64GB N/A
256GB -R "select[mem256gb]"
1 TB -R "select[mem1t]" -q xlarge
2 TB -R "select[mem1t]" -q xlarge
PHI -R "select[phi]"
Any GPU -R "select[gpu]"
64GB GPU -R "select[gpu64gb]"
256GB GPU -R "select[gpu256gb]"

Controlling Locality

On Ada's NextScale nodes, you can improve performance by ensuring that the nodes selected for your job are "close" to each other. This helps to minimize latencies between nodes during communication. For the best explanation of this, see the section Control job locality using compute units in the Administering IBM Platform LSF manual available in our local copy of the LSF documentation

For Ada's NextScale nodes, we define a "Compute Unit" as all the nodes connected to a single Infiniband switch. There are 24 nodes, each with 20 cores in each compute unit when means that you can run jobs up to 480 cores with only one "hop" (switch) between each node. Jobs using more than that number must use at least three hops for nodes on different switches (first to the source nodes switch, then to the core switch and then finally to the destination switch). Even nodes in the same rack (each rack has three Infiniband switches) will have to travel this distance.

If you are running multinode jobs and are either concerned about

  • consistency (e.g. for benchmarking), or,
  • maximum efficiency

you should consider making use of the settings for locality.

Be aware, however, that it may take longer before your job can be scheduled. If you ask for 24 nodes all on one switch, the scheduler will delay your job until that constraint can be met. If you ask for any 24 nodes, the scheduler may pick one node from each of 24 switches. Although the latter may run sooner, it will be much more inefficient since every node involved must pass through the core switch to talk to any other node.

For details on syntax, see the link above. In general, the following two settings may be the most useful:

Setting Result
-R "cu[pref=maxavail]" This will select nodes that are on switches that are the least utilized. This will help to group nodes together to help minimize interswitch communication. It won't be as efficient as the next setting, but should cut down the amount of time your job has to wait before starting
-R "cu[maxcus=number]" This will guarantee that your job will utilize no more than number of compute units. So, if number=1, you can use up to 480 cores and be sure of the most efficient communication pattern. With 2, you can go up to 960, which any one node can communicate with 23 nodes in only one hop and the other 24 nodes in three hops

Again, see the link above for details.

Note, that you can also combine settings. For example,

-R "cu[pref=maxavail:maxcus=3]"

would assign your jobs to the three emptiest switches. The myriad of options/combinations is too much to document here. Just keep in mind that by using compute units to minimize communications costs can have a significant impact.

Batch Job files/Scripts

A batch request is expressed through a batch file. That is, a text file, a job script, so called, with appropriate directives and other specifications or commands. A batch file, say, sample.job, consists of the LSF directives section (top part) and the (UNIX) commands section. In the latter you specify all the commands that need to be executed. All LSF directives start with the #BSUB string.

Structure of Job Files/Scripts

Here is the general layout of a common BSUB jofile.

#BSUB directive(s)1
#BSUB directive(s)2
#BSUB ...

#UNIX commands section. From here on down "#" on col 1 starts a comment
#<-- at this point $HOME is the current working directory
cmd1
cmd2
...

The UNIX command section is executed on a single node or multiple nodes. Serial and OpenMP programs execute on only one node, MPI programs can use 1-848 nodes. The default current working directory is $HOME. If that is not a practical choice for you, you should explicitly change (cd) to the directory of your choice. Many times a convenient choice is the directory you submit jobs from. BSUB stores that directory's location in the LSB_SUBCWD environment variable. Also, by default, the executing UNIX shell is the bash shell.

After that we lay out an example of a complete job file. You submit the job script for execution using the bsub < jobfile command (see below). BSUB then assigns it a job id, which you can use to track your job's progress through the system. The job's priority relative to other jobs will be determined based on several factors. This priority is used to order the jobs for consideration by the batch system for execution on the compute nodes.

Below is a sample job script for a serial job which requests only one node.

## job name
#BSUB -J matrix_serial_job

## send stderr and stdout to the same file
#BSUB -o matrix_out.%J

## login shell to avoid copying env from login session
## also helps the module function work in batch jobs
#BSUB -L /bin/bash

## 30 minutes of walltime ([HH:]MM)
#BSUB -W 30

## numprocs
#BSUB -n 1

## load intel toolchain
module load ictce

time ./matrix.exe

Here are more example job scripts.

Environment Variables

LSF provides some environment variables that can be accessed from a job script. See the Bsub man page for the complete list of LSF environment variables. Below is a list of common environment variables.


Variable Description
$LSB_JOBID Contains the job id.
$LSB_JOB_CWD Sets the current working directory for job execution.
$LSB_SUBCWD Sets the directory where job submitted .
$LSB_HOSTS Gives a list of compute nodes assigned to the job, one entry per MPI task. A useful command to capture the contents of the list may be, cat $LSB_HOSTS. This environment variable is not available when the list of hostnames is more than 4096 bytes.
$LSB_MCPU_HOSTS Gives a string of compute nodes assigned to the job in a compact format. For example, "hostA 20 hostB 10". This should be used for large groups of nodes.
$LSB_DJOB_HOSTFILE Points to a file containing the hostnames in a format useable by MPI.

Notes

  • If you get an error like "DAT: library load failure: libdaplomcm.so.2: cannot open shared object file: No such file or directory" then try adding this to your job file:
export I_MPI_FABRICS='shm:ofa'

See Also