ACES: Introduction to HPC and AI for Faculty and Researchers
Overview
Instructor(s): Dr. Zhenhua He and Dr. Dinesh S. Devarajan
Time: Tuesday, Oct 7, 2025 1:30PM-4:00PM CT
Location: Online using Zoom
Prerequisite(s): Active ACCESS ID, basic Python Programming skills, familiarity with PyTorch is preferred but not required
This short course will cover the basic concepts and fundamentals of High Performance Computing (HPC), Artificial Intelligence (AI) and why HPC is important for AI. Participants will learn to use scikit-learn and PyTorch libraries to build, train, and evaluate machine learning and deep learning models in JupyterLab on a HPC cluster - ACES. We will also cover distributed training strategies with a focus on PyTorch Distributed Data Parallel (DDP). Through hands-on exercises, we will progress step by step: starting from CPU-based training, moving to a single GPU, scaling up to multiple GPUs on a single node, and finally extending to multi-node distributed training.
Course Materials
Presentation Slides
- ACES: Introduction to HPC and AI for Faculty and Researchers (Fall 2025): PDF
Learning Objectives
After this short course, participants will be able to:
- Explain the concepts and fundamentals for HPC and describe its importance for AI.
- Use Scikit-learn and PyTorch frameworks to build, train, and evaluate machine learning and deep learning models in JupyterLab on the ACES cluster.
- Describe and compare different distributed training strategies for deep learning.
- Transition deep learning workloads from CPU to GPU training and scale from a single GPU to multiple GPUs within a single node.
- Extend distributed training to multi-node HPC environments.