Skip to content

Hardware

ACES: A Dell x86 HPC Cluster

System Name:

ACES

Operating System:

Red Hat Enterprise 8

Total Compute Cores/Nodes:

11,888 cores
130 nodes

Compute Nodes:

110 Intel Sapphire Rapids Nodes
17 Intel Ice Lake Nodes
1 AMD Rome Graphcore Node with 16 Mk2 Colossus GC200 IPUs
1 Intel Ice Lake Graphcore Node with 16 Bow-2000 IPUs
1 Intel Cascade Lake Node with 8 NEC Vector Engine Type 20B-P cards

Composable Components

30 NVIDIA H100 GPUs
4 NVIDIA A30 GPUs
3 Bittware Agilex FPGAs
2 Intel D5005 FPGAs
2 NextSilicon coprocessors
48 Intel Optane SSDs
120 Intel GPU Max 1100

Interconnect:

NVIDIA Mellanox NDR200 InfiniBand (MPI and storage)
Liqid PCIe Gen4 Fabrics (composability)
Liqid PCIe Gen5 Fabrics (composability)

Peak Performance:

1.6 PFLOPS

Global Disk:

2.3PB (usable) via a DDN Lustre appliance

File System:

Lustre

Batch Facility:

Slurm by SchedMD

Location:

West Campus Data Center

Production Date:

July 2023

ACES is a Dell cluster with a rich accelerator testbed consisting of Intel Max GPUs (Graphics Processing Units), Intel FPGAs (Field Programmable Gate Arrays), NVIDIA H100 GPUs, NEC Vector Engines, NextSilicon co-processors, Graphcore IPUs (Intelligence Processing Units). The ACES cluster consists of compute nodes using a mix of the following processors:

The compute nodes interconnected with NVIDIA NDR200 connections for MPI and access to the Lustre storage. The Intel Optane SSDs and all accelerators (except the Graphcore IPUs and NEC Vector Engines) are accessed using Liqid's composable framework via PCIe (Peripheral Component Interconnect express) Gen4 and Gen5 fabrics.

Compute Nodes

Table 2: Details of Compute Nodes

Sapphire Rapids Nodes Ice Lake Nodes Cascade Lake Node + NEC VEs AMD Rome 7742 Node
Processor Type Intel Xeon 8468 Intel Xeon 8352Y Intel Xeon 8268 AMD EPYC Rome 7742
Sockets per node 2 2 2 2
Cores per socket 48 32 24 64
Cores per Node 96 64 48 128
Clock rate (Base/Turbo) 2.10GHz / 3.8GHz 2.20GHz / 3.40 GHz 2.90GHz / 3.90GHz 2.25GHz / 3.4GHz
Memory 512 GB DDR5-4800 256 GB DDR4-3200 768 GB DDR4-3200 768 GG DDR4-3200
Cache 105 MB 48 MB 35.75 MB 256 MB
Local Disk Space 1.6 TB NVMe (/tmp) 3.84 TB NVMe (/tmp) 480 GB SATA 3.5 TB NVMe /localdata

System Interconnect

The ACES compute nodes are interconnected with NDR200 links. The leaf and core switches are interconnected with NDR (400Gb) links in a fat tree topology with a 2:1 oversubscription factor.

aces_all.png

Data Transfer Nodes

ACES has two 100Gb data transfer nodes that can be used to transfer data to ACES via the Globus Connect web interface or Globus command line. Globus Connect Server v5.4 is installed on the data transfer nodes. One data transfer node is dedicated to ACCESS users and its collection is listed as “ACCESS TAMU ACES DTN”.

Usable Memory for Batch Jobs

While nodes on ACES have 512 GB of RAM, some of this memory is used to maintain the software and operating system of the node. In most cases, excessive memory requests will be automatically rejected by SLURM.

ACES nodes will have a memory limit of 488 GB per node for batch jobs.

Login Nodes

Table 2: Details of Login Nodes

Login Nodes

Access

Login at portal-aces.hprc.tamu.edu

Processor Type

Intel Xeon 8468 (Sapphire Rapids)

Memory

512 GB DDR5-4800

Total Nodes

2

Cores/Node

96

Interconnect

NVIDIA Mellanox NDR200 InfiniBand

Local Disk Space

1.6 TB NVMe (/tmp)