Hprc banner tamu.png

Difference between revisions of "LSF:Advanced:Locality"

From TAMU HPRC
Jump to: navigation, search
(= Job Placement in the IB Fabric)
(Job Placement in the IB Fabric on Ada)
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Locality ==
+
__TOC__
 +
 
 +
== Physical Locality ==
 +
 
 +
=== Introduction ===
 +
 
 +
[https://en.wikipedia.org/wiki/List_of_unusual_units_of_measurement#Light-nanosecond Grace Hopper's attempt to explain locality]
  
 
The "physical locality" of compute nodes can have significant impact on runtime performance.  20 nodes that are "near" to each other in the cluster will usually have better performance than 20 nodes scattered across the cluster if your application uses a lot of multiprocess communication.  This applies both to the Infiniband fabric used on Ada and the 10Gbps Ethernet fabric used on curie.
 
The "physical locality" of compute nodes can have significant impact on runtime performance.  20 nodes that are "near" to each other in the cluster will usually have better performance than 20 nodes scattered across the cluster if your application uses a lot of multiprocess communication.  This applies both to the Infiniband fabric used on Ada and the 10Gbps Ethernet fabric used on curie.
  
The InfiniBand (IB) fabric(/network) that interconnects Ada's nodes is comprised of a central "core" switch, leaf switches, and the optical cables that interconnect them all together, including the nodes.  
+
The InfiniBand (IB) fabric(/network) that interconnects Ada's nodes is comprised of a central "core" switch, leaf switches, the cluster nodes, and all the optical cables that interconnect them together.
  
Nodes, up to 24, attach directly a leaf switch. So, any two cores on different nodes attached to the same leaf switch, can communicate by going across only one hop/switch. When the two communicating cores are on different nodes that are attached to different leaf switches, then the number of hops is at least three (node->leaf->core->leaf->node).
+
Nodes, up to 24, attach directly to a leaf switch. So, any two cores on different nodes attached to the same leaf switch, can communicate by going across only one hop/switch. When the two communicating cores are on different nodes that are attached to different leaf switches, then the number of hops increases by at least two (e.g. node->leaf->core->leaf->node).
  
 
The manner by which LSF places a job across nodes on the IB fabric is based on its own internal algorithm.
 
The manner by which LSF places a job across nodes on the IB fabric is based on its own internal algorithm.
Sometimes, a user may want to "encourage" a different arrangement from that of LSF.
+
Sometimes, a user may want to "encourage" a different arrangement from that of LSF to take advantage of locality.
  
=== Job Placement in the IB Fabric ==  
+
=== Job Placement in the IB Fabric on Ada ===  
  
 
On Ada's NextScale nodes, you can improve performance by ensuring that the nodes selected for your job are "close" to each other. This helps to minimize latencies between nodes during communication.  
 
On Ada's NextScale nodes, you can improve performance by ensuring that the nodes selected for your job are "close" to each other. This helps to minimize latencies between nodes during communication.  
  
For the best explanation of this, see the section Control job locality using compute units in the
+
For the best explanation of this, see the section [https://hprc.tamu.edu/softwareDocs/lsf/9.1.2/lsf_admin/compute_unit_lsf.html Compute Units] in the
Administering IBM Platform LSF manual available in our local copy of the LSF documentation [HRPC - need link for our latest online copy of this]
+
[https://hprc.tamu.edu/softwareDocs/lsf/9.1.2/lsf_admin/index.htm Administering IBM Platform LSF manual]  Another good reference is the man page for bsub (i.e. "man bsub").
  
 
For Ada's NextScale nodes, we define a "Compute Unit" as all the nodes
 
For Ada's NextScale nodes, we define a "Compute Unit" as all the nodes
Line 23: Line 29:
 
number must use at least three hops for nodes on different switches (first to
 
number must use at least three hops for nodes on different switches (first to
 
the source nodes switch, then to the core switch and then finally to the
 
the source nodes switch, then to the core switch and then finally to the
destination switch). Even nodes in the same rack (each rack has three
+
destination switch). Even nodes in the same rack, but on different switches (each rack has three
 
Infiniband switches) will have to travel this distance.
 
Infiniband switches) will have to travel this distance.
  
If you are running multinode jobs and are either concerned about
+
You should consider making use of the settings for locality if you are running multinode jobs and are concerned about
 
* consistency (e.g. for benchmarking), or,
 
* consistency (e.g. for benchmarking), or,
 
* maximum efficiency
 
* maximum efficiency
you should consider making use of the settings for locality.
 
  
Be aware, however, that it may take longer before your job can be scheduled.
+
Be aware, however, that it may take longer before your job can be scheduled. If you ask for 24 nodes all on one switch, the scheduler will delay your job until that constraint can be met. If you ask for any 24 nodes, the scheduler may randomly pick nodes from 24 random switches spread all over the cluster. Although the latter may run sooner, it will be much more inefficient since every node involved must pass
If you ask for 24 nodes all on one switch, the scheduler will delay your job
 
until that constraint can be met. If you ask for any 24 nodes, the scheduler
 
may pick one node from each of 24 switches. Although the latter may run
 
sooner, it will be much more inefficient since every node involved must pass
 
 
through the core switch to talk to any other node.
 
through the core switch to talk to any other node.
  
 
For details on syntax, see the link above. In general, the following two
 
For details on syntax, see the link above. In general, the following two
 
settings may be the most useful:
 
settings may be the most useful:
Setting        Result
 
  
 
     -R "cu[pref=maxavail]"   
 
     -R "cu[pref=maxavail]"   
  
This will select nodes that are on switches that are
+
This will select nodes that are on switches that are the least utilized. This will help to group nodes together to help minimize interswitch communication. It won't be as efficient as the next setting, but should cut down the amount of time your job has to wait before starting
the least utilized. This will help to group nodes together to help minimize
 
interswitch communication. It won't be as efficient as the next setting, but
 
should cut down the amount of time your job has to wait before starting
 
  
 
     -R "cu[maxcus=number]"  
 
     -R "cu[maxcus=number]"  
  
This will guarantee that your job will utilize no more
+
This will guarantee that your job will utilize no more than number of compute units. So, if number=1, you can use up to 480 cores and be sure of the most efficient communication pattern. With 2, you can go up to
than number of compute units. So, if number=1, you can use up to 480 cores and
+
960, which any one node can communicate with 23 nodes in only one hop and the other 24 nodes in three hops
be sure of the most efficient communication pattern. With 2, you can go up to
 
960, which any one node can communicate with 23 nodes in only one hop and the
 
other 24 nodes in three hops
 
  
Again, see the link above for details.
+
Again, see the link above and "man bsub" for details.
  
 
Note, that you can also combine settings. For example,
 
Note, that you can also combine settings. For example,
Line 63: Line 57:
 
     -R "cu[pref=maxavail:maxcus=3]"
 
     -R "cu[pref=maxavail:maxcus=3]"
  
would assign your jobs to the three emptiest switches. The myriad of
+
would assign your jobs to the three emptiest switches. The myriad of options/combinations is too much to document here. Just keep in mind that by using compute units to minimize communications costs can have a significant impact.
options/combinations is too much to document here. Just keep in mind that by
+
 
using compute units to minimize communications costs can have a significant
+
=== References ===
impact.
+
* [https://hprc.tamu.edu/softwareDocs/lsf/9.1.2/ IBM Platform LSF Documentation]
  
 
[[Category:Ada]]
 
[[Category:Ada]]

Latest revision as of 08:27, 11 December 2019

Physical Locality

Introduction

Grace Hopper's attempt to explain locality

The "physical locality" of compute nodes can have significant impact on runtime performance. 20 nodes that are "near" to each other in the cluster will usually have better performance than 20 nodes scattered across the cluster if your application uses a lot of multiprocess communication. This applies both to the Infiniband fabric used on Ada and the 10Gbps Ethernet fabric used on curie.

The InfiniBand (IB) fabric(/network) that interconnects Ada's nodes is comprised of a central "core" switch, leaf switches, the cluster nodes, and all the optical cables that interconnect them together.

Nodes, up to 24, attach directly to a leaf switch. So, any two cores on different nodes attached to the same leaf switch, can communicate by going across only one hop/switch. When the two communicating cores are on different nodes that are attached to different leaf switches, then the number of hops increases by at least two (e.g. node->leaf->core->leaf->node).

The manner by which LSF places a job across nodes on the IB fabric is based on its own internal algorithm. Sometimes, a user may want to "encourage" a different arrangement from that of LSF to take advantage of locality.

Job Placement in the IB Fabric on Ada

On Ada's NextScale nodes, you can improve performance by ensuring that the nodes selected for your job are "close" to each other. This helps to minimize latencies between nodes during communication.

For the best explanation of this, see the section Compute Units in the Administering IBM Platform LSF manual Another good reference is the man page for bsub (i.e. "man bsub").

For Ada's NextScale nodes, we define a "Compute Unit" as all the nodes connected to a single Infiniband switch. There are 24 nodes, each with 20 cores in each compute unit when means that you can run jobs up to 480 cores with only one "hop" (switch) between each node. Jobs using more than that number must use at least three hops for nodes on different switches (first to the source nodes switch, then to the core switch and then finally to the destination switch). Even nodes in the same rack, but on different switches (each rack has three Infiniband switches) will have to travel this distance.

You should consider making use of the settings for locality if you are running multinode jobs and are concerned about

  • consistency (e.g. for benchmarking), or,
  • maximum efficiency

Be aware, however, that it may take longer before your job can be scheduled. If you ask for 24 nodes all on one switch, the scheduler will delay your job until that constraint can be met. If you ask for any 24 nodes, the scheduler may randomly pick nodes from 24 random switches spread all over the cluster. Although the latter may run sooner, it will be much more inefficient since every node involved must pass through the core switch to talk to any other node.

For details on syntax, see the link above. In general, the following two settings may be the most useful:

   -R "cu[pref=maxavail]"  

This will select nodes that are on switches that are the least utilized. This will help to group nodes together to help minimize interswitch communication. It won't be as efficient as the next setting, but should cut down the amount of time your job has to wait before starting

   -R "cu[maxcus=number]" 

This will guarantee that your job will utilize no more than number of compute units. So, if number=1, you can use up to 480 cores and be sure of the most efficient communication pattern. With 2, you can go up to 960, which any one node can communicate with 23 nodes in only one hop and the other 24 nodes in three hops

Again, see the link above and "man bsub" for details.

Note, that you can also combine settings. For example,

   -R "cu[pref=maxavail:maxcus=3]"

would assign your jobs to the three emptiest switches. The myriad of options/combinations is too much to document here. Just keep in mind that by using compute units to minimize communications costs can have a significant impact.

References