Hardware
ACES: A Dell x86 HPC Cluster
System Name: |
ACES |
Operating System: |
Red Hat Enterprise 8 |
Total Compute Cores/Nodes: |
11,888 cores |
Compute Nodes: |
110 Intel Sapphire Rapids Nodes 17 Intel Ice Lake Nodes 1 AMD Rome Graphcore Node with 16 Mk2 Colossus GC200 IPUs 1 Intel Ice Lake Graphcore Node with 16 Bow-2000 IPUs 1 Intel Cascade Lake Node with 8 NEC Vector Engine Type 20B-P cards |
Composable Components |
30 NVIDIA H100 GPUs |
Interconnect: |
NVIDIA Mellanox NDR200 InfiniBand (MPI and storage) |
Peak Performance: |
1.6 PFLOPS |
Global Disk: |
2.3PB (usable) via a DDN Lustre appliance |
File System: |
Lustre |
Batch Facility: |
|
Location: |
West Campus Data Center |
Production Date: |
July 2023 |
ACES is a Dell cluster with a rich accelerator testbed consisting of Intel Max GPUs (Graphics Processing Units), Intel FPGAs (Field Programmable Gate Arrays), NVIDIA H100 GPUs, NEC Vector Engines, NextSilicon co-processors, Graphcore IPUs (Intelligence Processing Units). The ACES cluster consists of compute nodes using a mix of the following processors:
- Intel Xeon 8468 Sapphire Rapids processors
- Intel Xeon Ice Lake 8352Y processors
- Intel Xeon Cascade Lake 8268 processors
- AMD Epyc Rome 7742 processors
The compute nodes interconnected with NVIDIA NDR200 connections for MPI and access to the Lustre storage. The Intel Optane SSDs and all accelerators (except the Graphcore IPUs and NEC Vector Engines) are accessed using Liqid's composable framework via PCIe (Peripheral Component Interconnect express) Gen4 and Gen5 fabrics.
Compute Nodes
Table 2: Details of Compute Nodes
Sapphire Rapids Nodes | Ice Lake Nodes | Cascade Lake Node + NEC VEs | AMD Rome 7742 Node | |
---|---|---|---|---|
Processor Type | Intel Xeon 8468 | Intel Xeon 8352Y | Intel Xeon 8268 | AMD EPYC Rome 7742 |
Sockets per node | 2 | 2 | 2 | 2 |
Cores per socket | 48 | 32 | 24 | 64 |
Cores per Node | 96 | 64 | 48 | 128 |
Clock rate (Base/Turbo) | 2.10GHz / 3.8GHz | 2.20GHz / 3.40 GHz | 2.90GHz / 3.90GHz | 2.25GHz / 3.4GHz |
Memory | 512 GB DDR5-4800 | 256 GB DDR4-3200 | 768 GB DDR4-3200 | 768 GG DDR4-3200 |
Cache | 105 MB | 48 MB | 35.75 MB | 256 MB |
Local Disk Space | 1.6 TB NVMe (/tmp) | 3.84 TB NVMe (/tmp) | 480 GB SATA | 3.5 TB NVMe /localdata |
System Interconnect
The ACES compute nodes are interconnected with NDR200 links. The leaf and core switches are interconnected with NDR (400Gb) links in a fat tree topology with a 2:1 oversubscription factor.
Data Transfer Nodes
ACES has two 100Gb data transfer nodes that can be used to transfer data to ACES via the Globus Connect web interface or Globus command line. Globus Connect Server v5.4 is installed on the data transfer nodes. One data transfer node is dedicated to ACCESS users and its collection is listed as “ACCESS TAMU ACES DTN”.
Usable Memory for Batch Jobs
While nodes on ACES have 512 GB of RAM, some of this memory is used to maintain the software and operating system of the node. In most cases, excessive memory requests will be automatically rejected by SLURM.
ACES nodes will have a memory limit of 488 GB per node for batch jobs.
Login Nodes
Login Nodes |
|
---|---|
Access |
Login at portal-aces.hprc.tamu.edu |
Processor Type |
Intel Xeon 8468 (Sapphire Rapids) |
Memory |
512 GB DDR5-4800 |
Total Nodes |
2 |
Cores/Node |
96 |
Interconnect |
NVIDIA Mellanox NDR200 InfiniBand |
Local Disk Space |
1.6 TB NVMe (/tmp) |