ACES: Software for AI on HPC
Overview
Instructor(s): Dr. Zhenhua He and Richard Lawrence
Time: Tuesday, September 9, 2025 10:00AM-12:30PM CT
Location: Online using Zoom
Prerequisite(s): Active ACCESS ID, basic Linux/Unix skills
This short course will provide an overview of the resources available on the ACES cluster to support AI workflows and applications. We will introduce a wide range of tools useful for management of software, data, and jobs. Later classes taught by HPRC will expand on individual topics.
A Registration button will appear here when registration has been opened.
Course Materials
Presentation Slides
The presentation slides are available as downloadable PDF files.
- ACES: Software for AI on HPC (Fall 2025): PDF
- ACES: Software for AI on HPC (Spring 2025): PDF
Learning Objectives and Agenda
In this course, participants will:
- Understand the role of HPC in AI Workflows
- Learn to set up and manage HPC environments for AI
- Understand software for resource management and AI Workload Distribution
- Learn ways to optimize AI performance on HPC clusters
This course focuses, among others, on the following topics:
- Introduction to HPC for AI
- What is HPC, and why is it important for AI?
- Overview of HPC resources
- Environment Setup on HPC Clusters
- Modules System
- Conda/Virtual Environment
- Containers
- Software for Efficient Resource Management and Allocation
- SLURM
- Drona
- AI Workload Distribution Software
- Specialized software
- Intel oneAPI for Intel GPUs
- Graphcore poplar SDK
- Hugging Face Hub
- Performance Optimization
- System Management Interface SMI
- nvidia-smi
- xpumcli
- sysmon
- NVIDIA Nsight
- TensorBoard
