ACES: Introduction to HPC and AI for Faculty and Researchers

Overview

Instructor(s): Dr. Zhenhua He and Dr. Dinesh S. Devarajan

Time: Tuesday, Oct 7, 2025 — 1:30PM-4:00PM CT

Location: Online using Zoom

Prerequisite(s): Active ACCESS ID, basic Python Programming skills, familiarity with PyTorch is preferred but not required

This short course will cover the basic concepts and fundamentals of High Performance Computing (HPC), Artificial Intelligence (AI) and why HPC is important for AI. Participants will learn to use scikit-learn and PyTorch libraries to build, train, and evaluate machine learning and deep learning models in JupyterLab on a HPC cluster - ACES. We will also cover distributed training strategies with a focus on PyTorch Distributed Data Parallel (DDP). Through hands-on exercises, we will progress step by step: starting from CPU-based training, moving to a single GPU, scaling up to multiple GPUs on a single node, and finally extending to multi-node distributed training.

Course Materials

Presentation Slides

  • ACES: Introduction to HPC and AI for Faculty and Researchers (Fall 2025): PDF

Learning Objectives

After this short course, participants will be able to:

  • Explain the concepts and fundamentals for HPC and describe its importance for AI.
  • Use Scikit-learn and PyTorch frameworks to build, train, and evaluate machine learning and deep learning models in JupyterLab on the ACES cluster.
  • Describe and compare different distributed training strategies for deep learning.
  • Transition deep learning workloads from CPU to GPU training and scale from a single GPU to multiple GPUs within a single node.
  • Extend distributed training to multi-node HPC environments.