Hardware
Grace: A Dell x86 HPC Cluster
System Name: |
Grace |
Host Name: |
grace.hprc.tamu.edu |
Operating System: |
Linux (CentOS 7) |
Total Compute Cores/Nodes: |
45,376 cores |
Compute Nodes: |
800 48-core compute nodes, each with 384GB RAM |
Interconnect: |
Mellanox HDR 100 InfiniBand |
Peak Performance: |
6.3 PFLOPS |
Global Disk: |
5PB (usable) via DDN Lustre appliances for general use |
File System: |
Lustre and GPFS |
Batch Facility: |
|
Location: |
West Campus Data Center |
Production Date: |
Spring 2021 |
Grace is an Intel x86-64 Linux cluster with 940 compute nodes (45,376 total cores) and 5 login nodes. There are 800 compute nodes with 384 GB of memory, and 132 GPU nodes with 384 GB of memory. Among the 117 GPU nodes, there are 100 GPU nodes two A100 40 GB GPU cards, 9 GPU nodes with two RTX 6000 24GB GPU cards, 8 GPU nodes with four T4 16GB GPU cards, and 15 GPU nodes with two A40 48GB GPUs. These 800 compute nodes and 117 GPU nodes are a dual socket server with two Intel 6248R 3.0GHz 24-core processors, commonly known as Cascade Lake. There are 8 compute nodes with 3 TB of memory and four Intel 6248 2.5 GHz 20-core processors.
The interconnecting fabric is a two-level fat-tree based on HDR 100 InfiniBand.
High performance mass storage of 5 petabyte (usable) capacity is made available to all nodes by the DDN Lustre storage. Also, 3.3 PB of Lenovo DSS GPFS storage is for their respective research labs.
Get details on using this system, see the User Guide for Grace.
Compute Nodes
A description of the four types of compute nodes is below:
General 384GB |
GPU A100 |
GPU RTX 6000 |
GPU T4 |
GPU A40 |
Large Memory 3TB |
|
---|---|---|---|---|---|---|
Total Nodes |
800 |
100 |
9 |
8 |
15 |
8 |
Processor Type |
Intel Xeon 6248R (Cascade Lake), 3.0GHz, 24-core |
Intel Xeon 6248 (Cascade Lake), 2.5 GHz, 20-core |
||||
Sockets/Node |
2 |
4 |
||||
Cores/Node |
48 |
80 |
||||
Memory/Node |
384GB DDR4, 3200 MHz |
3TB DDR4, 3200 MHz |
||||
Accelerator(s) |
N/A |
2 NVIDIA A100 40GB GPU |
2 NVIDIA RTX 6000 24GB GPU |
4 NVIDIA T4 16GB GPU |
2 NVIDIA A40 48GB GPU |
N/A |
Interconnect |
Mellanox HDR100 InfiniBand |
|||||
Local Disk Space |
1.6TB NVMe (/tmp), 480GB SSD |
Usable Memory for Batch Jobs
While nodes on Grace have either 384GB or 3TB of RAM, some of this memory is used to maintain the software and operating system of the node. In most cases, excessive memory requests will be automatically rejected by SLURM.
The table below contains information regarding the approximate limits of Grace memory hardware and our suggestions on its use.
384GB Nodes (Regular and GPU) |
3TB Nodes |
|
---|---|---|
Node Count |
917 |
8 |
Number of Cores |
48 Cores |
80 Cores |
Memory Limit |
7500 MB |
37120 MB |
Memory Limit |
368640 MB |
2969600 MB |
Login Nodes
The grace.hprc.tamu.edu hostname can be used to access the Grace cluster. This translates into one of the five login nodes, grace[1-5].hprc.tamu.edu. To access a specific login node use its corresponding host name (e.g., grace2.hprc.tamu.edu). All login nodes have 10 GbE connections to the TAMU campus network and direct access to all global parallel (Lustre-based) file systems. The table below provides more details about the hardware configuration of the login nodes.
NVIDIA A100 GPU |
NVIDIA RTX 6000 GPU |
NVIDIA T4 GPU |
No GPU |
|
---|---|---|---|---|
Hostnames |
grace1.hprc.tamu.edu |
grace2.hprc.tamu.edu |
grace3.hprc.tamu.edu |
grace4.hprc.tamu.edu |
Processor Type |
Intel Xeon 6248R 3.0GHz 24-core |
|||
Memory |
384GB DDR4 3200 MHz |
|||
Total Nodes |
1 |
1 |
1 |
2 |
Cores/Node |
48 |
|||
Interconnect |
Mellanox HDR100 InfiniBand |
|||
Local Disk Space |
per node: two 480 GB SSD drives, 1.6 TB NVMe |
Mass Storage
- 5PB (usable) with Lustre provided by DDN ES200NV appliance and two ES7990X appliances
- 1.4PB (usable) with GPFS provided by Lenovo's DSS-G220 appliance
- 1.9PB (usable) with GPFS provided by Lenovo's DSS-G230 appliance
Interconnect
Two level fat tree topology with Mellanox HDR100:
- There are 5 core switches and 11 leaf switches.
- Each leaf switch has 2 Mellanox HDR InfiniBand (200Gb/s) uplinks to each core switch.
- There are up to 80 compute nodes attached to each leaf switch.
- Each login or compute node has a single Mellanox HDR100 InfiniBand (100Gb/s) link to a leaf switch.
- The DDN storage has 12 total HDR100 links.
- The Lenovo DSS-G220 storage (Dr. Junjie Zhang's CryoEM Lab) has 8 HDR100 links.
- The Lenovo DSS-G230 storage (Dr. Ping Chang's iHESP Lab) has 8 EDR links (100Gb/s).
Namesake
"Grace" is named for Grace Hopper.