ACES: Software for AI on HPC
Overview
Instructor(s): Dr. Zhenhua He and Richard Lawrence
Time: Tuesday, February 18, 2025 10:00AM-12:30PM CT
Location: Online using Zoom
Prerequisite(s): Active ACCESS ID, basic Linux/Unix skills
This short course will provide an overview of the resources available on the ACES cluster to support AI workflows and applications. We will introduce a wide range of tools useful for management of software, data, and jobs. Later classes taught by HPRC will expand on individual topics.
Course Materials
Presentation Slides
The presentation slides are available as downloadable PDF files.
- ACES: Software for AI on HPC (Spring 2025): PDF
Learning Objectives and Agenda
In this course, participants will:
- Understand the role of HPC in AI Workflows
- Learn to set up and manage HPC environments for AI
- Understand software for resource management and AI Workload Distribution
- Learn ways to optimize AI performance on HPC clusters
This course focuses, among others, on the following topics:
- Introduction to HPC for AI
- What is HPC, and why is it important for AI?
- Overview of HPC resources
- Environment Setup on HPC Clusters
- Modules System
- Conda/Virtual Environment
- Containers
- Software for Efficient Resource Management and Allocation
- SLURM
- Drona
- AI Workload Distribution Software
- Specialized software
- Intel oneAPI for Intel GPUs
- Graphcore poplar SDK
- Hugging Face Hub
- Performance Optimization
- System Management Interface SMI
- nvidia-smi
- xpumcli
- sysmon
- NVIDIA Nsight
- TensorBoard