Hprc banner tamu.png

LSF:Advanced:Locality

From TAMU HPRC
Revision as of 20:06, 10 December 2019 by J-perdue (talk | contribs) (Physical Locality)
Jump to: navigation, search

Physical Locality

Introduction

https://en.wikipedia.org/wiki/List_of_unusual_units_of_measurement#Light-nanosecond

The "physical locality" of compute nodes can have significant impact on runtime performance. 20 nodes that are "near" to each other in the cluster will usually have better performance than 20 nodes scattered across the cluster if your application uses a lot of multiprocess communication. This applies both to the Infiniband fabric used on Ada and the 10Gbps Ethernet fabric used on curie.

The InfiniBand (IB) fabric(/network) that interconnects Ada's nodes is comprised of a central "core" switch, leaf switches, the cluster nodes, and all the optical cables that interconnect them together.

Nodes, up to 24, attach directly to a leaf switch. So, any two cores on different nodes attached to the same leaf switch, can communicate by going across only one hop/switch. When the two communicating cores are on different nodes that are attached to different leaf switches, then the number of hops increases by at least two (e.g. node->leaf->core->leaf->node).

The manner by which LSF places a job across nodes on the IB fabric is based on its own internal algorithm. Sometimes, a user may want to "encourage" a different arrangement from that of LSF to take advantage of locality.

Job Placement in the IB Fabric on Ada

On Ada's NextScale nodes, you can improve performance by ensuring that the nodes selected for your job are "close" to each other. This helps to minimize latencies between nodes during communication.

For the best explanation of this, see the section Control job locality using compute units in the Administering IBM Platform LSF manual available in our local copy of the LSF documentation [HRPC - NEED TO LINK OUR LATEST ONLINE VERSION OF THIS] Another good reference is the man page for bsub (i.e. "man bsub").

For Ada's NextScale nodes, we define a "Compute Unit" as all the nodes connected to a single Infiniband switch. There are 24 nodes, each with 20 cores in each compute unit when means that you can run jobs up to 480 cores with only one "hop" (switch) between each node. Jobs using more than that number must use at least three hops for nodes on different switches (first to the source nodes switch, then to the core switch and then finally to the destination switch). Even nodes in the same rack, but on different switches (each rack has three Infiniband switches) will have to travel this distance.

You should consider making use of the settings for locality if you are running multinode jobs and are concerned about

  • consistency (e.g. for benchmarking), or,
  • maximum efficiency

Be aware, however, that it may take longer before your job can be scheduled. If you ask for 24 nodes all on one switch, the scheduler will delay your job until that constraint can be met. If you ask for any 24 nodes, the scheduler may randomly pick nodes from 24 random switches spread all over the cluster. Although the latter may run sooner, it will be much more inefficient since every node involved must pass through the core switch to talk to any other node.

For details on syntax, see the link above. In general, the following two settings may be the most useful:

[HPRC THIS WAS A TABLE AT ONE POINT]

Setting Result

   -R "cu[pref=maxavail]"  

This will select nodes that are on switches that are the least utilized. This will help to group nodes together to help minimize interswitch communication. It won't be as efficient as the next setting, but should cut down the amount of time your job has to wait before starting

   -R "cu[maxcus=number]" 

This will guarantee that your job will utilize no more than number of compute units. So, if number=1, you can use up to 480 cores and be sure of the most efficient communication pattern. With 2, you can go up to 960, which any one node can communicate with 23 nodes in only one hop and the other 24 nodes in three hops

Again, see the link above and "man bsub" for details.

Note, that you can also combine settings. For example,

   -R "cu[pref=maxavail:maxcus=3]"

would assign your jobs to the three emptiest switches. The myriad of options/combinations is too much to document here. Just keep in mind that by using compute units to minimize communications costs can have a significant impact.