ACES: Software for AI on HPC

Overview

Instructor(s): Dr. Zhenhua He and Richard Lawrence

Time: Tuesday, February 18, 2025 10:00AM-12:30PM CT

Location: Online using Zoom

Prerequisite(s): Active ACCESS ID, basic Linux/Unix skills

This short course will provide an overview of the resources available on the ACES cluster to support AI workflows and applications. We will introduce a wide range of tools useful for management of software, data, and jobs. Later classes taught by HPRC will expand on individual topics.

Course Materials

Presentation Slides

The presentation slides are available as downloadable PDF files.

  • ACES: Software for AI on HPC (Spring 2025): PDF

Learning Objectives and Agenda

In this course, participants will:

  • Understand the role of HPC in AI Workflows
  • Learn to set up and manage HPC environments for AI
  • Understand software for resource management and AI Workload Distribution
  • Learn ways to optimize AI performance on HPC clusters

This course focuses, among others, on the following topics:

  • Introduction to HPC for AI
    • What is HPC, and why is it important for AI?
    • Overview of HPC resources
  • Environment Setup on HPC Clusters
    • Modules System
    • Conda/Virtual Environment
    • Containers
  • Software for Efficient Resource Management and Allocation
    • SLURM
    • Drona
    • AI Workload Distribution Software
    • Specialized software
      • Intel oneAPI for Intel GPUs
      • Graphcore poplar SDK
    • Hugging Face Hub
  • Performance Optimization
      System Management Interface SMI
      • nvidia-smi
      • xpumcli
      • sysmon
    • NVIDIA Nsight
    • TensorBoard