Intel PVC GPUs

Introduction

The ACES cluster has as total of 120 Intel Data Center Max GPU 1100 GPUs Throughout this documentation, these GPUs are referred to as the Intel PVC (Ponte Vecchio) GPUs.

Accessing Intel PVC GPUs

Interactive

Access a Intel PVC compute node interactively using srun with the resource options --partition=pvc and --gres=gpu:pvc:<num_gpus>.

srun --time=01:00:00 --partition=pvc --gres=gpu:pvc:1 --pty bash -i

Load all the necessary modules

module purge
module load WebProxy
module load intelpython/2024.1.0_814
module load intel/2023.07

Next, you can create a python virtual environment for your project following the Intel AI Frameworks and Tools Selector

Alternatively, we have created a shared Intel AI environment including the Intel extensions for PyTorch and TensorFlow

Here is a list of the packages in the shared environment.

------------ absl-py annotated-types astunparse cachetools certifi cffi charset-normalizer cloudpickle deepspeed filelock flatbuffers fsspec gast google-auth google-auth-oauthlib google-pasta grpcio h5py hjson idna importlib_metadata intel_ex intel_ex intel_ex intel_op Jinja2 keras libclang Markdown MarkupSafe ml-dtypes mpmath networkx ninja numpy oauthlib oneccl-bind-pt opt_einsum packaging pillow pip protobuf psutil py-cpuinfo pyasn1 pyasn1_modules pycparser pydantic pydantic_core pynvml PyYAML requests requests-oauthlib rsa setuptools six sympy tensorboard tensorbo tensorflow tensorflow-estimator tensorfl termcolor torch torchaudio torchvision tqdm typing_extensions urllib3 Werkzeug wheel wrapt zipp

for you to use on the ACES cluster. You can activate it by running href="#__codelineno-2-1">source /sw/hprc/sw/Python/virtualenvs/intelpython/2024.1.0_814/intel-ai-python-env/bin/activate href="#__codelineno-3-1">Package Version ---------------------- ---------------------- 1.4.0 0.7.0 1.6.3 5.5.0 2024.8.30 1.17.1 3.4.0 3.0.0 0.14.2 3.16.1 24.3.25 2024.9.0 0.6.0 2.35.0 1.2.1 0.2.0 1.66.2 3.12.1 3.1.0 3.10 8.5.0 tension_for_pytorch 2.1.40+xpu tension_for_tensorflow 2.15.0.1 tension_for_tensorflow_lib 2.15.0.1.2 timization_for_horovod 0.28.1.5 3.1.4 2.15.0 18.1.1 3.7 3.0.1 0.3.2 1.3.0 3.2.1 1.11.1.1 1.26.4 3.2.2 2.1.400+xpu 3.4.0 24.1 10.4.0 23.0.1 4.23.4 6.0.0 9.0.0 0.6.1 0.4.1 2.22 2.9.2 2.23.4 11.5.3 6.0.2 2.32.3 2.0.0 4.9 58.1.0 1.16.0 1.13.3 2.15.2 ard-data-server 0.7.2 2.15.1 2.15.0 ow-io-gcs-filesystem 0.37.1 2.5.0 2.1.0.post3+cxx11.abi 2.1.0.post3+cxx11.abi 0.16.0.post3+cxx11.abi 4.66.5 4.12.2 2.2.3 3.0.4 0.44.0 1.14.1 3.20.2

Then set the environment variables for use with the oneAPI toolkits

source /sw/hprc/sw/oneAPI/2024.2/setvars.sh

Run your python script

cd $SCRATCH/<path-to-your-script>
python <name-of-script>.py

Alternatively, you can copy our ACES PVC training materials to your personal scratch directory and run a sample script.

cp -r /scratch/training/aces_pvc_tutorial $SCRATCH
cd $SCRATCH/aces_pvc_tutorial/pytorch/exercises/
python cifar10_pvc_solution.py

Job Submission

Intel PVCs can be accessed via slurm with the resource options --partition=pvc and --gres=gpu:pvc:<num_gpus>.

The folowing is an example of a job script.

#!/bin/bash
##NECESSARY JOB SPECIFICATIONS

#SBATCH --job-name=pt_demo
#SBATCH --time=01:00:00                  
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --nodes=1
#SBATCH --mem=50G                    
#SBATCH --output=pt_demo.%j
#SBATCH --gres=gpu:pvc:1              
#SBATCH --partition=pvc               

echo $SLURM_JOB_ID
hostname

# load all the necessary modules
module purge
module load WebProxy
module load intelpython/2024.1.0_814
module load intel/2023.07

# activate the shared virtual environment
source /sw/hprc/sw/Python/virtualenvs/intelpython/2024.1.0_814/intel-ai-python-env/bin/activate

# sets environment variables for use with the oneAPI toolkits
source /sw/hprc/sw/oneAPI/2024.2/setvars.sh

# change directory to your script
cd $SCRATCH/<path-to-your-script>

# executable command
python <name-of-script>.py

Monitor utilization

If you need to monitor the resource utilization on the PVC node(s), you may launch a VNC interactive job through the portal. Make sure to select Intel GPU Max (PVC) as the node type.

In the launched VNC session, load the modules and activate the environment, for example,

module purge
module load WebProxy
module load intelpython/2024.1.0_814
module load intel/2023.07
source /sw/hprc/sw/Python/virtualenvs/intelpython/2024.1.0_814/intel-ai-python-env/bin/activate
source /sw/hprc/sw/oneAPI/2024.2/setvars.sh
# change directory to your script
cd $SCRATCH/<path-to-your-script>
# executable command and run it in the background
python <name-of-script>.py &

Then you may use either of the two commands below to monitor gpu utilization which are detailed in the following sections.

sysmon

$ sysmon -h
Usage: ./sysmon [options]
Options:
--processes [-p]    Print short device information and running processes (default)
--list [-l]         Print list of devices and subdevices
--details [-d]      Print detailed information for all of the devices and subdevices
--help [-h]         Print help message
--version           Print version

This utility provides basic information about the GPUs on a node, including the list of processes that are attached to the GPU at the moment.

The Process mode (default mode) dumps short information about all the available GPUs and running processes. Example output of process mode:

$ sysmon
=====================================================================================
GPU 0: Intel(R) Data Center GPU Max 1100    PCI Bus: 0000:1b:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26516    Subdevices: 0
EU Count: 448    Threads Per EU: 8    EU SIMD Width: 16    Total Memory(MB): 46679.2
Core Frequency(MHz): 1400.0 of 1550.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: 4
     PID,  Device Memory Used(MB),  Shared Memory Used(MB),  GPU Engines, Executable
    4809,                     2.2,                     0.0,      COMPUTE, /usr/bin/xpumd
 2651639,                     5.2,                     0.0,  COMPUTE;DMA, python
 2651661,                 46213.8,                     0.0,  COMPUTE;DMA, python
 2652076,                     2.2,                     0.0,      UNKNOWN, sysmon
=====================================================================================
GPU 1: Intel(R) Data Center GPU Max 1100    PCI Bus: 0000:21:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26516    Subdevices: 0
EU Count: 448    Threads Per EU: 8    EU SIMD Width: 16    Total Memory(MB): 46679.2
Core Frequency(MHz): 200.0 of 1550.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: 4
     PID,  Device Memory Used(MB),  Shared Memory Used(MB),  GPU Engines, Executable
    4809,                     2.2,                     0.0,      COMPUTE, /usr/bin/xpumd
 2651639,                     0.4,                     0.0,      COMPUTE, python
 2651661,                     6.9,                     0.0,      COMPUTE, python
 2652076,                     2.2,                     0.0,      UNKNOWN, sysmon
=====================================================================================
GPU 2: Intel(R) Data Center GPU Max 1100    PCI Bus: 0000:29:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26516    Subdevices: 0
EU Count: 448    Threads Per EU: 8    EU SIMD Width: 16    Total Memory(MB): 46679.2
Core Frequency(MHz): 200.0 of 1550.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: 4
     PID,  Device Memory Used(MB),  Shared Memory Used(MB),  GPU Engines, Executable
    4809,                     2.2,                     0.0,      COMPUTE, /usr/bin/xpumd
 2651639,                     0.4,                     0.0,      COMPUTE, python
 2651661,                     6.9,                     0.0,      COMPUTE, python
 2652076,                     2.2,                     0.0,      UNKNOWN, sysmon
=====================================================================================
GPU 3: Intel(R) Data Center GPU Max 1100    PCI Bus: 0000:2d:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26516    Subdevices: 0
EU Count: 448    Threads Per EU: 8    EU SIMD Width: 16    Total Memory(MB): 46679.2
Core Frequency(MHz): 200.0 of 1550.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: 4
     PID,  Device Memory Used(MB),  Shared Memory Used(MB),  GPU Engines, Executable
    4809,                     2.2,                     0.0,      COMPUTE, /usr/bin/xpumd
 2651639,                     0.4,                     0.0,      COMPUTE, python
 2651661,                     4.4,                     0.0,      COMPUTE, python
 2652076,                     2.2,                     0.0,      UNKNOWN, sysmon

To monitor the usage periodically pair it with linux's watch utility:

watch -n 5 sysmon # calls sysmon every 5 seconds

xpumcli

See the XPU Manager CLI help info

$ xpumcli -h
Intel XPU Manager Command Line Interface -- v1.2
Intel XPU Manager Command Line Interface provides the Intel data center GPU model and monitoring capabilities. It can also be used to change the Intel data center GPU settings and update the firmware.
Intel XPU Manager is based on Intel oneAPI Level Zero. Before using Intel XPU Manager, the GPU driver and Intel oneAPI Level Zero should be installed rightly.
Supported devices:
  - Intel Data Center GPU
Usage: xpumcli [Options]
  xpumcli -v
  xpumcli -h
  xpumcli discovery
Options:
  -h,--help                   Print this help message and exit
  -v,--version                Display version information and exit.
Subcommands:
  discovery                   Discover the GPU devices installed on this machine and provide the device info.
  topology                    Get the system topology.
  group                       Group the managed GPU devices.
  diag                        Run some test suites to diagnose GPU.
  health                      Get the GPU device component health status.
  policy                      Get and set the GPU policies.
  updatefw                    Update GPU firmware
  config                      Get and change the GPU settings.
  topdown                     Expected feature.
  ps                          List status of processes.
  vgpu                        Create and remove virtual GPUs in SRIOV configuration.
  stats                       List the GPU aggregated statistics since last execution of this command or XPU Manager daemon is started.
  dump                        Dump device statistics data.
  log                         Collect GPU debug logs.
  agentset                    Get or change some XPU Manager settings.
  amcsensor                   List the AMC real-time sensor readings.

XPU manager can be used to get raw device statistics like temperature, core frequency etc.

A full list of available metrics can be found with the dump command.

$ xpumcli dump
Dump device statistics data.
Usage: xpumcli dump [Options]
  xpumcli dump -d [deviceIds] -t [deviceTileIds] -m [metricsIds] -i [timeInterval] -n [dumpTimes]
  xpumcli dump --rawdata --start -d [deviceId] -t [deviceTileId] -m [metricsIds]
  xpumcli dump --rawdata --list
  xpumcli dump --rawdata --stop [taskId]
Options:
  -h,--help                   Print this help message and exit
  -j,--json                   Print result in JSON format

  -d,--device                 The device IDs or PCI BDF addresses to query. The value of "-1" means all devices.
  -t,--tile                   The device tile IDs to query. If the device has only one tile, this parameter should not be specified.
  -m,--metrics                Metrics type to collect raw data, options. Separated by the comma.
                              0. GPU Utilization (%), GPU active time of the elapsed time, per tile
                              1. GPU Power (W), per tile
                              2. GPU Frequency (MHz), per tile
                              3. GPU Core Temperature (Celsius Degree), per tile
                              4. GPU Memory Temperature (Celsius Degree), per tile
                              5. GPU Memory Utilization (%), per tile
                              6. GPU Memory Read (kB/s), per tile
                              7. GPU Memory Write (kB/s), per tile
                              8. GPU Energy Consumed (J), per tile
                              9. GPU EU Array Active (%), the normalized sum of all cycles on all EUs that were spent actively executing instructions. Per tile.
                              10. GPU EU Array Stall (%), the normalized sum of all cycles on all EUs during which the EUs were stalled. Per tile.
                                  At least one thread is loaded, but the EU is stalled. Per tile.
                              11. GPU EU Array Idle (%), the normalized sum of all cycles on all cores when no threads were scheduled on a core. Per tile.
                              12. Reset Counter, per tile.
                              13. Programming Errors, per tile.
                              14. Driver Errors, per tile.
                              15. Cache Errors Correctable, per tile.
                              16. Cache Errors Uncorrectable, per tile.
                              17. GPU Memory Bandwidth Utilization (%)
                              18. GPU Memory Used (MiB)
                              19. PCIe Read (kB/s), per GPU
                              20. PCIe Write (kB/s), per GPU
                              21. Xe Link Throughput (kB/s), a list of tile-to-tile Xe Link throughput.
                              22. Compute engine utilizations (%), per tile.
                              23. Render engine utilizations (%), per tile.
                              24. Media decoder engine utilizations (%), per tile.
                              25. Media encoder engine utilizations (%), per tile.
                              26. Copy engine utilizations (%), per tile.
                              27. Media enhancement engine utilizations (%), per tile.
                              28. 3D engine utilizations (%), per tile.
                              29. GPU Memory Errors Correctable, per tile. Other non-compute correctable errors are also included.
                              30. GPU Memory Errors Uncorrectable, per tile. Other non-compute uncorrectable errors are also included.
                              31. Compute engine group utilization (%), per tile.
                              32. Render engine group utilization (%), per tile.
                              33. Media engine group utilization (%), per tile.
                              34. Copy engine group utilization (%), per tile.
                              35. Throttle reason, per tile.
                              36. Media Engine Frequency (MHz), per tile

  -i                          The interval (in seconds) to dump the device statistics to screen. Default value: 1 second.
  -n                          Number of the device statistics dump to screen. The dump will never be ended if this parameter is not specified.

  --rawdata                   Dump the required raw statistics to a file in background.
  --start                     Start a new background task to dump the raw statistics to a file. The task ID and the generated file path are returned.
  --stop                      Stop one active dump task.
  --list                      List all the active dump tasks.

Usage example for dump command:

$ xpumcli dump -d 0 -m 0,1,2,3,4,5
Timestamp, DeviceId, GPU Utilization (%), GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%)
11:01:41.000,    0, 0.00, 27.90,    0, 25.00, 21.00, 0.05
11:01:42.000,    0, 0.00, 28.87,    0, 24.50, 21.00, 0.05
11:01:43.000,    0, 0.00, 28.76,    0, 25.00, 21.00, 0.05
11:01:44.000,    0, 0.00, 28.77,    0, 24.00, 21.00, 0.05
11:01:45.000,    0, 0.00, 28.78,    0, 23.50, 21.00, 0.05
11:01:46.000,    0, 0.00, 28.84,    0, 24.00, 21.00, 0.05
11:01:47.000,    0, 0.00, 28.73,    0, 24.50, 21.00, 0.05

show_pvc_features

This script shows the current arrangement of nodes with PVCs composed over Liqid PCIe fabrics with and without Xe Link bridges. Note, there are now PVCs in both PCIe Gen4 and Gen5 fabrics. All nodes with Xe Link bridges are in Gen5 fabrics.

$ show_pvc_features
HOSTNAME AVAIL_FEATURES           GRES         STATE
ac010    gen4_fabric              gpu:pvc:4    mixed
ac011    gen4_fabric              gpu:pvc:4    mixed
ac012    gen4_fabric              gpu:pvc:4    mixed
ac013    gen4_fabric              gpu:pvc:4    mixed
ac023    gen4_fabric              gpu:pvc:4    idle
ac024    gen4_fabric              gpu:pvc:8    idle
ac025    gen4_fabric              gpu:pvc:4    mixed
ac026    gen5_fabric              gpu:pvc:6    reserved
ac030    gen5_fabric              gpu:pvc:8    reserved
ac034    gen5_fabric              gpu:pvc:4    reserved
ac039    gen5_fabric              gpu:pvc:4    reserved
ac050    gen5_nonfabric           gpu:pvc:2    mixed
ac051    gen5_nonfabric           gpu:pvc:2    drained*
ac062    gen4_fabric              gpu:pvc:4    mixed
ac068    gen4_fabric              gpu:pvc:8    idle
ac078    gen4_fabric              gpu:pvc:4    mixed
ac079    gen4_fabric              gpu:pvc:4    mixed
ac081    gen5_fabric,xelink4      gpu:pvc:4    reserved
ac082    gen5_fabric,xelink2      gpu:pvc:2    reserved
ac083    gen5_fabric              gpu:pvc:2    reserved
ac085    gen5_fabric,xelink4      gpu:pvc:4    reserved
ac086    gen5_fabric,xelink2      gpu:pvc:2    reserved
ac087    gen5_fabric,xelink2      gpu:pvc:2    reserved
ac089    gen5_fabric,xelink4      gpu:pvc:4    reserved
ac094    gen5_fabric,xelink2      gpu:pvc:2    reserved
ac095    gen5_fabric,xelink2      gpu:pvc:2    reserved
ac097    gen5_fabric,xelink2      gpu:pvc:2    reserved
ac099    gen5_fabric,xelink4      gpu:pvc:4    reserved
ac100    gen5_fabric,xelink2      gpu:pvc:2    allocated
ac101    gen5_fabric              gpu:pvc:4    reserved
ac102    gen5_fabric              gpu:pvc:4    reserved
ac103    gen5_fabric,xelink2      gpu:pvc:2    idle

This script shows the current arrangement of nodes with PVCs composed over Liqid PCIe fabrics with and without Xe Link bridges.
Note, there are now PVCs in both PCIe Gen4 and Gen5 fabrics.  All nodes with Xe Link bridges are in Gen5 fabrics.

Use the --constraint=xelink2 or --constraint=xelink4 sbatch option to request a node with a 2-way or 4-way Xe Link bridge.

LLM Inference with vLLM on PVCs

vLLM is an LLM inference software which allows for easy and efficient local LLM inference across multiple gpus.

Installation

You can install vLLM like so:

ml purge
ml GCC/12.2.0
ml Python/3.10.8
source /sw/hprc/sw/oneAPI/2025.0/setvars.sh
cd $SCRATCH
python3 -m venv venv
source venv/bin/activate
git clone git@github.com:vllm-project/vllm.git
cd vllm
pip install -r requirements/xpu.txt
VLLM_TARGET_DEVICE=xpu python setup.py install

Example Usage

Here is a simple example script for running tensor parallel offline inference with vLLM on multiple PVC gpus on a single node:

import os
import random
import subprocess
from vllm import LLM, SamplingParams
VLLM_HOST_IP = subprocess.check_output("hostname -I | awk '{print $1}'", shell=True).decode().strip()
os.environ["VLLM_HOST_IP"] = VLLM_HOST_IP
port = random.randint(1024, 65535) #port roulette
os.environ["VLLM_PORT"] = str(port)
MODEL_PATH = "/scratch/data/llms/llama-3_3-70B-instruct"
NUM_GPUS = int(os.environ["NUM_GPUS"])
llm = LLM(model=MODEL_PATH,dtype="bfloat16",enforce_eager=True,tensor_parallel_size=NUM_GPUS,max_model_len=1024, max_num_seqs=1, device="xpu")
def perform_inference(my_input:str, max_length:int = 512):
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95,max_tokens=max_length, min_tokens = 5)
    outputs = llm.generate([my_input], sampling_params)
    result = ""
    for output in outputs[0].outputs:
        result += output.text + "\n";
    return result
if __name__ == "__main__":
    result = perform_inference("write a simple pytorch program")
    print(result, flush=True)

You can then execute the above script on a compute node with sbatch. Note that this doesn't work if you use srun.
Job script:

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --mem=300G
#SBATCH --gres=gpu:pvc:4
#SBATCH --exclusive
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=96
#SBATCH --partition=pvc
#SBATCH --output=out
cd $SCRATCH
ml purge
ml GCC/12.2.0
ml Python/3.10.8
export NUM_GPUS=4
source /sw/hprc/sw/oneAPI/2025.0/setvars.sh
source venv/bin/activate
python3 main.py

use the command:

tail -f out

to track the progress of the script.