Difference between revisions of "Terra:Intro"
Tmarkhuang (talk | contribs) m (→Compute Nodes) |
(→Compute Nodes) |
||
(9 intermediate revisions by 3 users not shown) | |||
Line 28: | Line 28: | ||
|- | |- | ||
| Global Disk: | | Global Disk: | ||
− | | | + | | 7PB (raw) via Lenovo's DSS-G260 appliance for general use <br>1PB (raw) via Lenovo's GSS24 purchased by and dedicated for GEOSAT |
|- | |- | ||
| File System: | | File System: | ||
Line 43: | Line 43: | ||
|} | |} | ||
− | Terra is an Intel x86-64 Linux cluster with 320 compute nodes (9,632 total cores) and 3 login nodes. There are 256 compute nodes with 64 GB of memory, 48 compute nodes with 128 GB of memory and a K80 GPU card. These 304 compute nodes are a dual socket server with two [https://ark.intel.com/products/91754/Intel-Xeon-Processor-E5-2680-v4-35M-Cache-2_40-GHz Intel Xeon E5-2680 v4 2.40GHz 14-core processors]. There are also 16 Intel Knights Landing (KNL) nodes with 96 GB of memory with either 68 or 72 cores per node. | + | Terra is an Intel x86-64 Linux cluster with 320 compute nodes (9,632 total cores) and 3 login nodes. There are 256 compute nodes with 64 GB of memory, 48 compute nodes with 128 GB of memory and a K80 GPU card. These 304 compute nodes are a dual socket server with two [https://ark.intel.com/products/91754/Intel-Xeon-Processor-E5-2680-v4-35M-Cache-2_40-GHz Intel Xeon E5-2680 v4 2.40GHz 14-core processors], commonly known as Broadwell. There are also 16 Intel Knights Landing (KNL) nodes with 96 GB of memory with either 68 or 72 cores per node. |
− | The interconnecting fabric is a two-level fat-tree based on Intel Omni-Path Architecture (OPA). High performance mass storage of | + | The interconnecting fabric is a two-level fat-tree based on Intel Omni-Path Architecture (OPA). High performance mass storage of 7.4 petabyte (raw) capacity is made available to all nodes by one Lenovo DSS-G260 storage appliance (added in Fall 2019) and one GSS24 storage appliance (from Fall 2017). |
Get details on using this system, see the [[Terra | User Guide for Terra]]. | Get details on using this system, see the [[Terra | User Guide for Terra]]. | ||
Line 60: | Line 60: | ||
! KNL 96 GB (68 core)<br> Compute | ! KNL 96 GB (68 core)<br> Compute | ||
! KNL 96 GB (72 core)<br> Compute | ! KNL 96 GB (72 core)<br> Compute | ||
+ | ! V100 GPU 192 GB <br> Compute | ||
|- | |- | ||
| Total Nodes | | Total Nodes | ||
Line 66: | Line 67: | ||
| 8 | | 8 | ||
| 8 | | 8 | ||
+ | | 4 | ||
|- | |- | ||
| Processor Type | | Processor Type | ||
− | | colspan=2 | Intel Xeon E5-2680 v4 2.40GHz 14-core | + | | colspan=2 | Intel Xeon E5-2680 v4 (Broadwell), 2.40GHz, 14-core |
− | | Intel Xeon Phi CPU 7250 1.40GHz | + | | Intel Xeon Phi CPU 7250 (Knight's Landing), 1.40GHz, 68-core |
− | | Intel Xeon Phi CPU 7290 1.50GHz | + | | Intel Xeon Phi CPU 7290 (Knight's Landing), 1.50GHz, 72-core |
+ | | Intel Xeon Gold 5118 (Skylake), 12-core, 2.30 GHz | ||
|- | |- | ||
| Sockets/Node | | Sockets/Node | ||
| colspan=2 | 2 | | colspan=2 | 2 | ||
+ | | 2 | ||
| 2 | | 2 | ||
| 2 | | 2 | ||
Line 81: | Line 85: | ||
| 68 | | 68 | ||
| 72 | | 72 | ||
+ | | 24 | ||
|- | |- | ||
| Memory/Node | | Memory/Node | ||
Line 86: | Line 91: | ||
| 128 GB DDR4, 2400 MHz | | 128 GB DDR4, 2400 MHz | ||
| colspan=2 | 96 GB DDR4, 2400 MHz | | colspan=2 | 96 GB DDR4, 2400 MHz | ||
+ | | 192 GB, 2400 MHz | ||
|- | |- | ||
| Accelerator(s) | | Accelerator(s) | ||
Line 91: | Line 97: | ||
| 1 NVIDIA K80 Accelerator | | 1 NVIDIA K80 Accelerator | ||
| colspan=2 | N/A | | colspan=2 | N/A | ||
+ | | 2 NVIDIA 32GB V100 GPUs | ||
|- | |- | ||
| Interconnect | | Interconnect | ||
| colspan=2 | Intel Omni-Path Architecture (OPA) | | colspan=2 | Intel Omni-Path Architecture (OPA) | ||
− | | colspan= | + | | colspan=3 | Intel Omni-Path Architecture (OPA) |
|- | |- | ||
|Local Disk Space | |Local Disk Space | ||
| colspan=2 | 1TB 7.2K RPM SATA disk | | colspan=2 | 1TB 7.2K RPM SATA disk | ||
| colspan=2 | 220GB | | colspan=2 | 220GB | ||
+ | | 300GB | ||
|} | |} | ||
Line 111: | Line 119: | ||
== Login Nodes == | == Login Nodes == | ||
− | The '''terra.tamu.edu''' hostname can be used to access the Terra cluster. This translates into one of the three login nodes, '''terra[1-3].tamu.edu'''. To access a specific login node use its corresponding host name (e.g., terra2.tamu.edu). All login nodes have | + | The '''terra.tamu.edu''' hostname can be used to access the Terra cluster. This translates into one of the three login nodes, '''terra[1-3].tamu.edu'''. To access a specific login node use its corresponding host name (e.g., terra2.tamu.edu). All login nodes have 10 GbE connections to the TAMU campus network and direct access to all global parallel (GPFS-based) file systems. The table below provides more details about the hardware configuration of the login nodes. |
{| class="wikitable" style="text-align: center;" | {| class="wikitable" style="text-align: center;" | ||
Line 117: | Line 125: | ||
! | ! | ||
! No Accelerator | ! No Accelerator | ||
− | ! | + | ! Two NVIDIA K80 Accelerator |
|- | |- | ||
| HostNames | | HostNames | ||
Line 145: | Line 153: | ||
== Mass Storage == | == Mass Storage == | ||
− | + | 7PB (raw) via one Lenovo DSS-G260 appliance for general use and 1PB (raw) via Lenovo GSS24 appliance purchased by and dedicated for GEOSAT. All storage appliances use IBM Spectrum Scale (formerly GPFS). | |
== Interconnect == | == Interconnect == |
Latest revision as of 15:15, 5 April 2021
Terra: A Lenovo x86 HPC Cluster
Contents
Hardware Overview
System Name: | Terra |
Host Name: | terra.tamu.edu |
Operating System: | Linux (CentOS 7) |
Total Compute Cores/Nodes: | 9,632 cores 320 nodes |
Compute Nodes: | 256 28-core compute nodes, each with 64GB RAM 48 28-core GPU nodes, each with one dual-GPU Tesla K80 accelerator and 128GB RAM 8 68-core KNL nodes with 96GB RAM 8 72-core KNL nodes with 96GB RAM |
Interconnect: | Intel Omni-Path Fabric 100 Series switches. |
Peak Performance: | 377 TFLOPs |
Global Disk: | 7PB (raw) via Lenovo's DSS-G260 appliance for general use 1PB (raw) via Lenovo's GSS24 purchased by and dedicated for GEOSAT |
File System: | General Parallel File System (GPFS) |
Batch Facility: | Slurm by SchedMD |
Location: | Teague Data Center |
Production Date: | February 2017 |
Terra is an Intel x86-64 Linux cluster with 320 compute nodes (9,632 total cores) and 3 login nodes. There are 256 compute nodes with 64 GB of memory, 48 compute nodes with 128 GB of memory and a K80 GPU card. These 304 compute nodes are a dual socket server with two Intel Xeon E5-2680 v4 2.40GHz 14-core processors, commonly known as Broadwell. There are also 16 Intel Knights Landing (KNL) nodes with 96 GB of memory with either 68 or 72 cores per node.
The interconnecting fabric is a two-level fat-tree based on Intel Omni-Path Architecture (OPA). High performance mass storage of 7.4 petabyte (raw) capacity is made available to all nodes by one Lenovo DSS-G260 storage appliance (added in Fall 2019) and one GSS24 storage appliance (from Fall 2017).
Get details on using this system, see the User Guide for Terra.
Compute Nodes
A description of the two types of compute nodes is below:
General 64GB Compute |
GPU 128 GB Compute |
KNL 96 GB (68 core) Compute |
KNL 96 GB (72 core) Compute |
V100 GPU 192 GB Compute | |
---|---|---|---|---|---|
Total Nodes | 256 | 48 | 8 | 8 | 4 |
Processor Type | Intel Xeon E5-2680 v4 (Broadwell), 2.40GHz, 14-core | Intel Xeon Phi CPU 7250 (Knight's Landing), 1.40GHz, 68-core | Intel Xeon Phi CPU 7290 (Knight's Landing), 1.50GHz, 72-core | Intel Xeon Gold 5118 (Skylake), 12-core, 2.30 GHz | |
Sockets/Node | 2 | 2 | 2 | 2 | |
Cores/Node | 28 | 68 | 72 | 24 | |
Memory/Node | 64 GB DDR4, 2400 MHz | 128 GB DDR4, 2400 MHz | 96 GB DDR4, 2400 MHz | 192 GB, 2400 MHz | |
Accelerator(s) | N/A | 1 NVIDIA K80 Accelerator | N/A | 2 NVIDIA 32GB V100 GPUs | |
Interconnect | Intel Omni-Path Architecture (OPA) | Intel Omni-Path Architecture (OPA) | |||
Local Disk Space | 1TB 7.2K RPM SATA disk | 220GB | 300GB |
Note, each K80 accelerator has two GPUs.
Usable Memory for Batch Jobs
While nodes on Terra have either 64GB or 128GB of RAM, some of this memory is used to maintain the software and operating system of the node. In most cases, excessive memory requests will be automatically rejected by SLURM.
The table below contains information regarding the approximate limits of Terra memory hardware and our suggestions on its use.
64GB Nodes | 128GB Nodes | 96GB KNL Nodes (68 core) | 96GB KNL Nodes (72 core) | |
---|---|---|---|---|
Node Count | 256 | 48 | 8 | 8 |
Number of Cores | 28 Cores (2 sockets x 14 core) | 68 Cores | 72 Cores | |
Memory Limit Per Core |
2048 MB 2 GB |
4096 MB 4 GB |
1300 MB 1.25 GB |
1236 MB 1.20 GB |
Memory Limit Per Node |
57344 MB 56 GB |
114688 MB 112 GB |
89000 MB 84 GB |
89000 MB 84 GB |
SLURM may queue your job for an excessive time (or indefinitely) if waiting for some particular nodes with sufficient memory to become free.
`
Login Nodes
The terra.tamu.edu hostname can be used to access the Terra cluster. This translates into one of the three login nodes, terra[1-3].tamu.edu. To access a specific login node use its corresponding host name (e.g., terra2.tamu.edu). All login nodes have 10 GbE connections to the TAMU campus network and direct access to all global parallel (GPFS-based) file systems. The table below provides more details about the hardware configuration of the login nodes.
No Accelerator | Two NVIDIA K80 Accelerator | |
---|---|---|
HostNames | terra1.tamu.edu terra2.tamu.edu |
terra3.tamu.edu |
Processor Type | Intel Xeon E5-2680 v4 2.40GHz 14-core | |
Memory | 128 GB DDR4 2400 MHz | |
Total Nodes | 2 | 1 |
Cores/Node | 28 | |
Interconnect | Intel Omni-Path Architecture (OPA) | |
Local Disk Space | per node: two 900GB 10K RPM SAS drives |
Mass Storage
7PB (raw) via one Lenovo DSS-G260 appliance for general use and 1PB (raw) via Lenovo GSS24 appliance purchased by and dedicated for GEOSAT. All storage appliances use IBM Spectrum Scale (formerly GPFS).
Interconnect
Namesake
"terra" comes from the Latin word for "this planet" a.k.a. "Earth". One of the purposes of this cluster is to study images gathered from a "Earth Observation Satellite" (EOS). Given that we just retired a cluster named "Eos" (named after the Greek goddess of the dawn waiting to spread the light of knowledge each day), the name terra was chosen instead