LSF:Advanced:Locality
Locality
The "physical locality" of compute nodes can have significant impact on runtime performance. 20 nodes that are "near" to each other in the cluster will usually have better performance than 20 nodes scattered across the cluster if your application uses a lot of multiprocess communication. This applies both to the Infiniband fabric used on Ada and the 10Gbps Ethernet fabric used on curie.
The InfiniBand (IB) fabric(/network) that interconnects Ada's nodes is comprised of a central "core" switch, leaf switches, and the optical cables that interconnect them all together, including the nodes.
Nodes, up to 24, attach directly a leaf switch. So, any two cores on different nodes attached to the same leaf switch, can communicate by going across only one hop/switch. When the two communicating cores are on different nodes that are attached to different leaf switches, then the number of hops is at least three (node->leaf->core->leaf->node).
The manner by which LSF places a job across nodes on the IB fabric is based on its own internal algorithm. Sometimes, a user may want to "encourage" a different arrangement from that of LSF.
Job Placement in the IB Fabric on ada
On Ada's NextScale nodes, you can improve performance by ensuring that the nodes selected for your job are "close" to each other. This helps to minimize latencies between nodes during communication.
For the best explanation of this, see the section Control job locality using compute units in the Administering IBM Platform LSF manual available in our local copy of the LSF documentation [HRPC - need link for our latest online copy of this]
For Ada's NextScale nodes, we define a "Compute Unit" as all the nodes connected to a single Infiniband switch. There are 24 nodes, each with 20 cores in each compute unit when means that you can run jobs up to 480 cores with only one "hop" (switch) between each node. Jobs using more than that number must use at least three hops for nodes on different switches (first to the source nodes switch, then to the core switch and then finally to the destination switch). Even nodes in the same rack (each rack has three Infiniband switches) will have to travel this distance.
If you are running multinode jobs and are either concerned about
- consistency (e.g. for benchmarking), or,
- maximum efficiency
you should consider making use of the settings for locality.
Be aware, however, that it may take longer before your job can be scheduled. If you ask for 24 nodes all on one switch, the scheduler will delay your job until that constraint can be met. If you ask for any 24 nodes, the scheduler may pick one node from each of 24 switches. Although the latter may run sooner, it will be much more inefficient since every node involved must pass through the core switch to talk to any other node.
For details on syntax, see the link above. In general, the following two settings may be the most useful: Setting Result
-R "cu[pref=maxavail]"
This will select nodes that are on switches that are the least utilized. This will help to group nodes together to help minimize interswitch communication. It won't be as efficient as the next setting, but should cut down the amount of time your job has to wait before starting
-R "cu[maxcus=number]"
This will guarantee that your job will utilize no more
than number of compute units. So, if number=1, you can use up to 480 cores and be sure of the most efficient communication pattern. With 2, you can go up to 960, which any one node can communicate with 23 nodes in only one hop and the other 24 nodes in three hops
Again, see the link above for details.
Note, that you can also combine settings. For example,
-R "cu[pref=maxavail:maxcus=3]"
would assign your jobs to the three emptiest switches. The myriad of options/combinations is too much to document here. Just keep in mind that by using compute units to minimize communications costs can have a significant impact.