ACES: Software for AI on HPC

Overview

Instructor(s): Dr. Zhenhua He and Richard Lawrence

Time: Tuesday, September 9, 2025 10:00AM-12:30PM CT

Location: Online using Zoom

Prerequisite(s): Active ACCESS ID, basic Linux/Unix skills

This short course will provide an overview of the resources available on the ACES cluster to support AI workflows and applications. We will introduce a wide range of tools useful for management of software, data, and jobs. Later classes taught by HPRC will expand on individual topics.

A Registration button will appear here when registration has been opened.

Course Materials

Presentation Slides

The presentation slides are available as downloadable PDF files.

  • ACES: Software for AI on HPC (Fall 2025): PDF

  • ACES: Software for AI on HPC (Spring 2025): PDF

Learning Objectives and Agenda

In this course, participants will:

  • Understand the role of HPC in AI Workflows
  • Learn to set up and manage HPC environments for AI
  • Understand software for resource management and AI Workload Distribution
  • Learn ways to optimize AI performance on HPC clusters

This course focuses, among others, on the following topics:

  • Introduction to HPC for AI
    • What is HPC, and why is it important for AI?
    • Overview of HPC resources
  • Environment Setup on HPC Clusters
    • Modules System
    • Conda/Virtual Environment
    • Containers
  • Software for Efficient Resource Management and Allocation
    • SLURM
    • Drona
    • AI Workload Distribution Software
    • Specialized software
      • Intel oneAPI for Intel GPUs
      • Graphcore poplar SDK
    • Hugging Face Hub
  • Performance Optimization
      System Management Interface SMI
      • nvidia-smi
      • xpumcli
      • sysmon
    • NVIDIA Nsight
    • TensorBoard